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Abstract 

A Hilbert space embedding for probability measures has recently been proposed, with appli- 
cations including dimensionality reduction, homogeneity testing, and independence testing. 
This embedding represents any probability measure as a mean element in a reproducing 
kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be 
defined as the distance between distribution embeddings: we denote this as 7fc, indexed by 
the kernel function k that defines the inner product in the RKHS. 

We present three theoretical properties of 7fc. First, we consider the question of de- 
termining the conditions on the kernel k for which 7^ is a metric: such k are denoted 
characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions 
coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the em- 
bedding is injective). While previously published conditions may apply only in restricted 
circumstances (e.g. on compact domains), and are difficult to check, our conditions are 
straightforward and intuitive: integrally strictly positive definite kernels are characteristic. 
Alternatively, if a bounded continuous kernel is translation- invariant on R d , then it is char- 
acteristic if and only if the support of its Fourier transform is the entire M. d . Second, we 
show that the distance between distributions under 7^ results from an interplay between 
the properties of the kernel and the distributions, by demonstrating that distributions are 
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close in the embedding space when their differences occur at higher frequencies. Third, to 
understand the nature of the topology induced by 7^ , we relate 7^ to other popular metrics 
on probability measures, and present conditions on the kernel k under which 7^ metrizcs 
the weak topology. 

Keywords: Probability metrics, Homogeneity tests, Independence tests, Kernel methods, 
Universal kernels, Characteristic kernels, Hilbertian metric, Weak topology. 



1. Introduction 



The concept of distance between probability measures is a fundamental one and has found 
many applications in probability theory, informa t ion th eory and statistics (jRachevJ, |1991 



Rachev and Riischendori . 1998 ; Liese and Vaida . 20061 ). In statistics, distances between 
probability measures are used in a variety of applications, including hypothesis tests (ho- 
mogeneity tests, independence tests, and goodness-of-fit tests), density estimation, Markov 
chain monte carlo, etc. As an example, homogeneity testing, also called the two-sample 
problem, involves choosing whether to accept or reject a null hypothesis Hq : P = Q versus 
the alternative Hi : P 7^ Q, using random samples {Xj}"^ and {1^}™ =1 drawn i.i.d. from 
probability distributions P and Q on a topological space (M, A) . It is easy to see that solving 
this problem is equivalent to testing #0 : t(P> Q) = versus Hi : 7(P, Q) > 0, where 7 is a 
metric (or, more generally, a semi-metrid3) on the space of all probability measures defined 
on M. The problems of testing independence and goodness-of-fit can be posed in an analo- 
gous form. In non-parametric density estimation, ^y(pri5 

Po) can be used to study the quality 
of the density estimate, p n , that is based on the samples {Aj}™ =1 drawn i.i.d. from pq. Pop- 
ular examples for 7 in these statistical applications i nclude the Ku llback-Leibler divergence, 
the total variation distance, the Hellinger dist ance (IVaidal. 119891) — these t hree are spe- 



(Idel Barrio et all . ll99a T etc. 



cific instances of the generalized (^-divergence ( AH and Sifvevl . 19661 ; Csiszar . 1967 ) — the 



Kolmoqorov distance (Lehmann and Romanol . 2005 . Section 14.2), the Wasserstein distance 



In probability theory, the distance between probability measures is used in studying 
limit theorems, the popular example being the central limit theorem. Another application 
is in metrizing the weak convergen ce of probabili ty measures on a separable metric space, 
where the Levy-Prohorov distance ( Dudley . 2002) . Chap ter 11) and dual-bounded Lipschitz 
distance (also called the Dudley metric) (jDudleyl . 12002 . Chapter 11) are commonly used. 

In the present work, we will consider a particular pseudometric 1 on probabili ty distribu- 
tions which is an instance of an integral probability metric (IPM) ( Miiller . 19971 ). Denoting 

the set of all Borel probability measures on (M, A) , the IPM between Ps ^ and Q G & 
is defined as 



7^(P, 



sup 



/. 

JM 



fdP 



JM 



(1) 



1. Given a set M, a metric for M is a function p : M x M — > R+ such that (i) Vx, p(x,x) = 0, (ii) 
Vz,y, p(x,y) = p(y,x), (Hi) Vx,y,z, p{x,z) < p{x,y) + p(y,z), and (iv) p(x,y) = x = y. A 
semi-metric only satisfies (i), (ii) and (iv). A pseudometric only satisfies (i)-(iii) of the properties of a 
metric. Unlike a metric space (M,p), points in a pseudometric space need not be distinguishable: one 
may have p(x,y) = for x ^ y. 

Now, in the two-sample test, though we mentioned that 7 is a metric/semi- metric, it is sufficient 
that 7 satisfies (i) and (iv). 
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where 3" is a class of real- valued bounded measurable functions on M. In addition to the gen- 
eral application domains discussed earlier for metrics on p robabi l ities, I PMs have been used 
in pr oving central limit theorems using Stein's method ([S tein, 1972; iBarbour and Chenl . 
20051 k and are popular in empirical process theory (jvan der Vaart and Wellnerl . ll996l ). Since 
most of the applications listed above require 73 to be a metric on the choice of 3" is 
critical (note that irrespective of 3", 73 is a pseudometric on HP). The following are some 
examples of 3~ for which 73- is a metric. 

(a) 3~ = Ch(M), th e space of bounded continuous functions on (M, p), where p is a metric 
(jShorackl . I2OO0I Chapter 19, Definition 1.1). 



(b) 3" = Cbu(M), the spac e of bounded p-uniformly continuous functions on (M,p) 
Portmonteau theorem (jShorackl . l2000i . Chapter 19, Theorem 1.1). 



3" = {/ : 



< 1| =: 3Vy, wh ere 



, = sup a , eM |/(x)|. 75 is called the total 
variation distance (jShorackl . 12000 . Chapter 19, Proposition 2.2), which we denote as 
TV, i.e., 1?TV =: TV. 



(d) 3" = {/ : \\f\\ L < 1} =: 3%, where ||/|| £ := sup{|/(x) - f(y)\/p(x,y) :x^ymM}. 
||/||i is the Lipschitz semi- norm of a real- valued function / on M and 73 is called the 
Kantoroyich m etric. If (M, p) is separable, then 75 equals the Wasserstein distance 
(jPudlevl . liool . Theorem 11.8.2), denoted as W := j? w . 



\BL 



<!}=: fy, where ||/|| BL 



L + 



(e) 'J {/ . „ _ . . 

metric ( Shorackl . 120001 . Chapter 19, Definition 2.2), denoted as (3 := 73^. 



73 is called the Dudley 



(f) J = {l(_oo,tl : t £ K d } =: "Jks- Is is called the Kolmogorov distance ( Shorack . 2000l . 
Theorem 2.4). 



(g) ? = W 



uj £ K d } =: 3^. This choice of 3" results in the maximal difference 



between the characteristic functions of P and Q. That 7j c is a met r ic on 8? follows 
from the uniqueness theorem for characteristic functions (|Dudlevl . [2002J, Theorem 
9.5.1). 



Recently, iGretton et al.l (|2007l ) and ISmola et al.l (120071) conside red 3" to be the unit ball 

in a reproducing kernel Hilbert space (RKHS) "K dAronszajnl . Il95tt > , with k as its reproduc- 

i ng ke rnel (r.k.), i.e., 3~ = {/ : \\f\\% < 1} =: 3fc (also see Chapter 4 of lBerlinet and Thomas-Agnan 
(|2004h and references therein for related work): we denote 7j fc =: 7fc. While we have seen 
many possible 3" for which 73 a metric, 3fc has a number of important advantages: 



and Q are known 
drawn i.i.d. from 



Estimation of 73-: In applications such as hypothesis testing, P 
only through the respective random samples {X^JLi and {Yj}™ =1 
each, and 73- (P, Q) is estimated based on these samples. One approach is to com- 
pute 7gr(P,Q) using the empirical measures P m = ^ Y^JLi ^Xj and Q n = ~ YJj=i fy, 
where 5 X represents a Dirac measure at x. It can be shown that choosing 3" as 
Cfe(M), Cb u (M), "Jtv o r 3"c results in this approach no t yielding consistent estimates 
of 7gr(P, 0) for all P and Q foevrove and Gvorfil . Il99fil ). Although choosing 3" = 3V 
or 3^ yields consistent estimates of 7j(P, Q) for all P and Q when M = M. d , the rates 
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of conv ergence are dependent on d and become slow for large d (jSriperumbudur et al.1 . 
2009bl ) . On the other hand, 7fc(P m) Qn) is a w mn/(m + n)-consistent estimator of 
7it(P, Q) if k is measurable and bounded, for all P and O . If k i s translation invariant 
on M = M. d , the rate is independent of d ( Gretton et al. . 20071 ; Sriperumbudur et al. . 
2009bh . an important property when dealing with high dimensions. Moreover , 7f is 



not straightforward to compute when 3" is Cb(M), Cb u (M), 3V or 3~^ ( Weaver . 19991 . 
Section 2.3): by contrast, 7^(P, Q) is simply a sum of expectations of the kernel k (see 
Theorem Q] and (113P ). 



• Comparison to ^-divergences: Instead of using 7^ in statistical applications, one 
can also use ^-divergences. However, the estimators of (^-divergences (especially the 
Kullback-Leibler divergence) exhibit arbitraril y slow rates of conve rgence depend- 
ing on the distributions (see Wang et al. ( 2005 ); Nguyen et al. ( 20081 ) and references 
therein for details), while, as noted above, 7fc(P m > Q™) exhibits good convergence be- 
havior. 

• Structured domains: Since 7*; is dependent onl y on the kernel (s ee Theorem[T]) and 
kernels can be defined on arbitrary domains M ( Aronszajn . 1950), choosing 3" = 3"^ 
provides the flexibility o f measuring the distance between probability measures defined 
on structured domains ( Borgwardt et al. . 20061 ) like graphs, strings, etc., unlike 3" = 
^KS or 3~o which can handle only M = M. d . 

The distance measure 7^ has appeared in a wid e variety of applicat ions. These in- 



clude statistical hyp othesis testing, of homogeneity (iGretton et al.l. 120071). independence 



( Gretton et al. . 20081 ) . and conditional independence ( Fukumizu et all 20081 ): as well as in 
machine learning applicati ons including kernel independent component analysis ([Bach and Jordan 



200 



Gretton et all 12005) and kernel based dimensionality reduction for supervised learn- 



ing ([Fukumizu et al. . 20041 ) . In these applications, kernels offer a linear approach to deal 



with higher order statistics: given the problem of homogeneity testing, for example, differ- 
ences in higher order moments are encoded as differences in the means of nonlinear features 
of the variables. To capture all nonlinearities that are relevant to the problem at hand, the 
embedding RKHS therefore has to be "sufficiently large" that differences in the embeddings 
correspond to differences of interest in the distributions. Thus, a natural question is how 
to guarantee k provides a sufficiently rich RKHS so as to detect any difference in distri- 
butions. A second problem is to determine what properties of distributions result in their 
being proximate or distant in the embedding space. Finally, we would like to compare 7^ to 
the classical integral probability metrics listed earlier, when used to measure convergence of 
distributions. In the following section, we describe the contributions of the present paper, 
addressing each of these three questions in turn. 



1.1 Contributions 

The contributions in this paper are three-fold and explained in detail below. 
1.1.1 When is !K characteristic? 

Recently, Fukumizu et al. ( 20081 ) introduced the concept of a characteristic kernel, i.e., a 
reproducing kernel for which 7&(P, Q) = 4=> P = Q, P, Q £ £P, i.e., is a metric on . 
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The corresponding RKHS, "K is referred to as a characteristic RKHS. The following are two 
characterizations for characteristic RKHSs that have already been studied in literature: 



1. When M is compact, Gretton et al. ( 20071 ) showed that 'K is characteristic if k is 
universal in the sense of Steinwart ( 200ll . Definition 4), i.e., "K is dense in the Banach 
space of bounded continuous functions with respect to the supremum norm. Examples 
of such "K include those induced by the Gaussian and Laplacian kernels on every 
compact subset of M. d . 

2. Fukumizu et al. I d2008l . l2009al ) extended this characterization to non-compact M and 
showed that "K is characteristic if and only if the direct sum of "K and R is dense in 
the Banach space of r-integrable (for some r > 1) functions. Using this characteri- 
zation, they showed that the RKHSs induced by the Gaussian and Laplacian kernels 
(supported on the entire W 1 ) are characteristic. 

In the present study, we provide alternative conditions for characteristic RKHSs which 
address several limitations of the foregoing. First, it can be difficult to verify the conditions 
of denseness in both of the above characterizations. Second, universality is in any case an 
overly restrictive condition because universal kernels assume M to be compact, i.e., they 
induce a metric only on the space of probability measures that are supported on compact 
M. In a ddition, there ar e compactly supported kernels which are not universal, e.g., B2 n +i- 
splines (jSteinwartl . boOll ). which can be shown to be characteristic. 

In Section 13.11 we present the simple characterization that integrally strictly positive 
definite (pd) kernels (see Section [1.21 for the definition ) are ch aracteristic, i.e., the induced 
RKHS is characteristic (also see Sriperumbudur et all 2009a . Theorem 4). This condition 
is more natural — strict pd is a natural property of interest for kernels, unlike the denseness 
condition — and much easier to understand than the characterizations mentioned above. 
Examples of integrally strictly pd kernels on W 1 include the Gaussian, Laplacian, inverse 
multiquadratics, Matern kernel family, Z^n+i-sphnes, etc. 

Although the above characterization of integrally strictly pd kernels being characteristic 
is simple to understand, it is only a sufficient condition and does not provide an answer for 
kernels that are not integrally strictly pd@e.g., a Dirichlet kernel. Therefore, in Section T3.21 
we provide an easily checkable condition, after making some assumptions on the kernel. We 
present a complete characterization of characteristic kernels when the kernel is translation 
invariant on R . We show that a bounded continuous translation invariant kernel on M. d 
is characteristic if and only if the support of the Fourier transform of the kernel is the 
entire R d . This condition is easy to check compared t o the characterization s desc ribed 
above. An earlier version of this result was provided by Sriperumbudur et al. ( 20081 ): by 
comparison, we now present a simpler and more elegant proof. We also show that all 
compactly supported translation invariant kernels on R d are characteristic. Note, however, 
that the characterization of integral strict positive definiteness in Section 13.11 does not 
assume M to be R d nor k to be translation invariant. 

We extend the result of Section 1331 to M being a d- Torus, i.e., T d 



xS 1 = 



[0,2vr) d , where S 1 is a circle. In Section 13.31 we show that a translation invariant kernel on 



2. It can be shown that integrally strictly pd kernels are strictly pd (see footnote [4j| . Therefore, examples 
of kernels that are not integrally strictly pd include those kernels that are not strictly pd. 
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T is characteristic if and only if the Fourier series coefficients of the kernel are positive, 
i.e., the support of the Fourier spectrum is the entire Z d . The proof of this result is similar 
in flavor to the one in Section 13.21 As examples, the Poisson kernel can be shown to be 
characteristic, while the Dirichlet kernel is not. 

Based on the discussion so far, it is clear that the characteristic property of k is char- 
acterized in many ways. Given these characterizations, we would like to understand the 
relation betweeen them. For example, we know that if k is universal, then it is characteristic. 
Is the converse true? Similarly, as we mentioned before, integrally strictly pd kernels are 
characteristic and are also also strictly pd. Then what is the relation between characteristic 
and strictly pd kernels? In Section \3A\ we address these questions by exploring the relation 
between these characterizations, which are summarized in Figure [TJ 



1.1.2 Dissimilar distributions with small j k 

As we have seen, the characteristic property of a kernel is critical in distinguishing between 
distinct probability measures. Suppose, however, that for a given characteristic kernel k and 
for any e > 0, there exist P and Q, P ^ Q, such that 7fc(P, Q) < £■ Though k distinguishes 
between such P and Q, it can be difficult to tell the distributions apart in applications (even 
with characteristic kernels), since P and Q are then repl aced with finite samp les, and the 



distance between them may not be statistically significant (IGretton et all 120071 ). Therefore, 
given a characteristic kernel, it is of interest to determine the properties of distributions 
P and Q that will cause their embeddings to be close. To this end, in Section [H we show 
that given a kernel k (see Theorem 1191 for conditions on the kernel), for any e > 0, there 
exists P / Q (with non-trivial differences between them) such that 7fc(P, Q) < £■ These 
distributions are constructed so as to differ at a sufficiently high frequency, which is then 
penalized by the RKHS norm when computing 7^. 



1.1.3 When does 7^ metrize the weak topology on &7 

Given 7^, which is a metric on a natural question of theoretical and practical importance 
to ask is "how is 7^ related to other probability metrics, such as the Dudley metric (/?), 
Wasserstein distance (W), total variation metric (TV), etc?" For example, in applications 
like density estimation, wherein the unknown density is estimated based on finite samples 
drawn i.i.d. from it, the quality of the estimate is measured by computing the distance 
between the true density and the estimated density. In such a setting, given two probability 
metrics, p\ and P2, one might want to use the stronger!! of the two to determine this distance, 
as the convergence of the estimated density to the true density in the stronger metric implies 
the convergence in the weaker metric, while the converse is not true. On the other hand, one 
might need to use a metric of weaker topology (i.e., coarser topology) to show convergence 
of some estimators, as the convergence might not occur w.r.t. a metric of strong topology. 
Clarifying and comparing the topology of a metric on the probabilities is, thus, important 



3. Two metrics pi : Y x Y — > R+ and p2 : Y x Y ~ > R+ are said to be equivalent if pi(x, y) — p2(x, y) — 
0, V x, y £ Y . On the other hand, pi is said to be stronger than p2 if pi(x,y) = =>■ p2(x,y) — 0, Vi,y £ 
Y but not vice- versa. If pi is stronger than p2, then we say p2 is weaker than pi. Note that if pi is 
stronger (resp. weaker) than p2, then the topology induced by pi is finer (resp. coarser) than the one 
induced by p2. 
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in the analysis of density estimation. Based on this motivation, in Section [5j we analyze 
the relation between 7^ and other probability metrics, and show that jk is weaker than all 
these other metrics. 

It is well known in probability theory that f3 is weaker than W and TV, and it metrizes 
the weak topology ( we will provide formal definitions in Section [5]) on & ( Shorack . 2000l ; 



Gibbs and Su , 2002). Since 7^ is weaker than all these other probability metrics, i.e., the 



topology induced by 7^ is coarser than the one induced by these metrics, the next inter- 
esting question to answer would be, "When does 7^ metrize the weak topology on &T 1 In 
other words, for what k, does the topology induced by 7^ coincides with the weak topology? 
Answering this question would show that 7^ is equivalent to f3, while it is weaker than W 
and TV. In probability theory, the metrization of weak topology is of prime importance in 
proving results related to the weak convergence of probability measures. Therefore, knowing 
the answer to the above question will help in using 7^ as a theoretical tool in probability 
theory. To this end, in Section [5j we show that universal kernels on compact (M, p) metrize 
the weak topology on 8?. For the non-compact setting, we assume M = M. d and provide 
sufficient conditions on the kernel such that 7^ metrizes the weak topology on . 

In the following section, we introduce the notation and some definitions that are used 
throughout the paper. Supplementary results used in proofs are collected in Appendix A. 

1.2 Definitions and notation 

For M C R d and p a Borel measure on M, L r (M,p) denotes the Banach space of r-power 
(r > 1) /i-integrable functions. We will also use L r (M) for L r {M,p) and dx for dp(x) if 
p is the Lebesgue measure on M. C&(M) denotes the space of all bounded, continuous 
functions on M. The space of all r-continuously differentiable functions on M is denoted 
by C r (M), < r < 00. For x E C, x represents the complex conjugate of x. We denote as 
i the imaginary unit y— 1. 

For a measurable function / and a signed measure P, P/ := J f dP = J M f(x) dF(x). 
5 X represents the Dirac measure at x. The symbol 5 is overloaded to represent the Dirac 
measure, the Dirac-delta distribution, and the Kronecker-delta, which should be distinguish- 
able from the context. For M = W 1 , the characteristic function, <ftf> of P E 8? is defined as 
0p( w ) := / Rd e iujTx dF(x), 00 E M d . 

Vanishing at infinity and Cq(M): A complex function / on a locally compact Hausdorff 
space M is said to vanish at infinity if for every e > there exists a compact set K C M 
such that I f(x) \ < e for all x ^ K . The class of all continuous / on M which vanish at 
infinity is denoted as Cq{M). 

Holomorphic and entire functions: Let D C C d be an open subset and / : D — > C be 
a function. / is said to be holomorphic at the point zq E D if 

f (zo) := km (2) 

z-+zo Zq — Z 

exists. Moreover, / is called holomorphic if it is holomorphic at every zq E D. f is called 
an entire function if / is holomorphic and D = <C d . 
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Positive definite and strictly positive definite: A function k : M x M — >■ R is called 
positive definite (pd) if, for all n G N, a%, . . . , a n £K and all x%, . . . , x n G M, we have 

n 

aiCtjk(xi, Xj) > 0. (3) 

Furthermore, is said to be strictly pd if, for mutually distinct x\,...,x n G X, equality in 
([3]) only holds for a\ = ■ ■ ■ = a n = 0. rp is said to be a positive definite function on W 1 if 
k(x, y) = ip(x — y) is positive definite. 

Integrally strictly positive definite: Let M be a topological space. A measurable and 
bounded kernel, k is said to be integrally strictly positive definite if 

k(x,y)dfx(x)dn(y) > (4) 

M 

for all finite non-zero signed Borel measures, \x defined on M. 

The above definition is a generalization of integrally strictly positive definite functions 



(jStewartl ll97fiL Section 6): JJ M k(x,y)f(x)f(y) dx dy > for all / G L 2 (R d ), which is the 
strictly positive definiteness of the integral operator given by the kernel. Note that the 
above definition is not equivalent to the definition of strictly pd kernels: if k is integrally 
strictly pd, then it is strictly pd, while the converse is not true0 

Fourier transform in M. d : For / G L 1 (IR rf ), / and / v represent the Fourier transform and 
inverse Fourier transform of / respectively, defined as 

f(y)--=lA^f e- i y Tx f(x)dx, yeR d , (5) 



/» : tAtN I e^ V f(y)dy, x G IT. !G) 



1 

Convolution: If / and g are complex functions in M. d , their convolution / * g is defined by 

(f*g)(x) := / f(y)g(x-y)dy, (7) 



provided that the integral exists for almost all x G in the Lebesgue sense. Let /i be a 
finite Borel measure on M. d and / be a bounded measurable function on M. d . The convolution 
of / and /j,, f * fj,, which is a bounded measurable function, is defined by 

(f*fi)(x) := / f(x-y)dn(y). (8) 



4. Suppose k is not strictly pd. This means for all n 6 N and for all mutually distinct xi, . . . ,x n £ M, 
there exists I 3 a, / for some j £ {1, ...,n} such that J^™ [ _ 1 ajOtik(xj,xi) = 0. By defining 
H = 2? =1 otj8 Xj , it is easy to see that there exists ^ / such that JJ M k(x, y) djj,(x) d/j,(y) = 0, which 
means k is not integrally strictly pd. Therefore, if k is integrally strictl y pd, then it is strictly pd. 
However, the converse is not true. See ISteinwart and Christmannl (|200Sl . Proposition 4.60, Theorem 
4.62) for an example. 
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Rapidly decaying functions, S> d and S^ d \ Let Qi d be the space of compactly supported 
infinitely differentiable functions on R d , i.e., & d = {/ G C°°{R d ) | supp(/) is bounded}, 
where supp(/) = cl ({x G R d \ fix) ^ 0}). A function / : R d — > C is said to decay rapidly, 
or be rapidly decreasing, if for all iV G N, 



sup sup (1 + 

||a||i<NxeIR d 



x\\l) N \(T a f)(x)\<oc, 



(9) 



where a = {a\, . . . ,a d ) is an ordered d-tuple of non-negative ctj, \\a\\i 



«j and 



T a = (j-*^) ••• (\~tj^j ■ ^di called the Schwartz class, denotes the vector space of 
rapidly decreasing functions . Note that 3>d C 5?d. It can b e shown that for any / G S^d, 
/ G S? d and / v G y d (see iFollandl (|l999l Chapter 9) and iRudinl (|l99ll . Chapter 6) for 
details) . 

Distributions, tempered distributions, and S?' d \ A linear functional on Ql d which is 
continuous with respect to the Frechet topology (see lRudinl . Il99ll . Definition 6.3) is called a 
distribution in Mr. The space of all distributions in R d is denoted by &' d . A linear continuous 
functional over the space y d is called a tempered distribution and the space of all tempered 
distributions in R d is denoted by y' d . 

Support of a distribution: For an open set U C R d , 2t d {U) denotes the subspace of @ d 
consisting of the functions with support contained in U. Suppose flgfj. If U is an open 
set of R d and if D(tp) = for every ip G d (U), then D is said to vanish or be null in U. 
Let W be the union of all open U C R d in which D vanishes. The complement of W is the 
support of D. 

For complete d etails on distri bution theory an d Four i er tra nsforms of distributions, we 
refer the reader to IFollandl ijlflfld . Chapter 9) and|Rudm| (Il99ll . Chapter 6). 



2. Hilbert Space Embedding of Probability Measures 

We previously mentio ned that 7^ is related t o the theory o f RKH S embedding of probability 
measures described in lGretton et al.1 ( 2007 ): Smola et al. ( 2007), and original l y intr oduced 
and studied in the late 70's and early 80's (see iBerlinet and Thomas- Agnanl ((200J, Chap- 
ter 4) and references therein for details). The following result shows how such embedding 
can be obtained through an alternative representation for jk. 



Theorem 1 Let &> k := {F G & : j M ^Jk{x,x) dF(x) < oo} ; where k is measurable on M . 
Then for any P,QG & k , 



7fe(P,Q) 



k(-,x)dP(x) - / k(-,x) 

M JM 



--: \\¥k - ®k\\x, 



(10) 



where "K is the RKHS generated by k. 



Proof Let Tjp : "K — > R be the linear functional defined as Tp[/] := J M fix) dP(x) with 
||T P || := sup /eW> 



. Consider 



|Tp[/]| 



foT 



M 



< / \f{x)\d¥{ X ) = [ \(f,k(;X)) % \d¥(x)< [ >/kJx~^)\\f\\ % dW(x), 
'M JM JM 
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which implies ||Tp|| < oo, VP G i.e., Ty is a bounded l i near functional on "K. There- 
fore, by the Riesz representation theorem ( Reed and Simonl . 19721 . Theorem II. 4), for each 
P G there exists a unique Ap G J( such that Tp[f] = (f, Xp}%, V/ G IK. Let 

/ = k(-,u) for some u G M. Then, Tp[k(-,u)} = {k(-, u), Ap)j{ = Ap(it), which implies 
A P = / M fc(-,ar)dP(x) =: PA;. Therefore, with 



7 - Q/l = |T P [/] - r Q [/]| = |(/, A P ) W - (/, a q ) w | = |(/, a p 



Ml 



we have 

7fc(P,Q)= sup |P/-Q/| 

II/II«<1 

Note that this holds for any P, Q G & k . 



\®\\x = \\Fk-Qk\\ % . 



Given a kernel, k, (|10p holds for all P G &).. However, in practice, especially in statistical 
inference applications, it is not possible to check whether P G &k as P is n °t known. 
Therefore, one would prefer to have a kernel such that 



M 



y / k(x,x)dF(x) < oo, V] 



(11) 



The following proposition shows that (jlip is equivalent to the kernel being bounded. There- 
fore, combining Theorem Q] and Proposition [2] shows that if k is measurable and bounded, 
then 7fc(P,Q) = ||Pfc - Qk\\x for any P,Q G 



Proposition 2 Let f be a measurable function on M . Then J M f(x) cZP(x) < oo for all 
P G & if and only if f is bounded. 

Proof One direction is straightforward because if / is bounded, then J M f(x) d¥(x) < oo 
for all P G & . Let us consider the other direction. Suppose / is not bounded. Then 
there exists a sequence {x n } C M such that f{x n ) n —-£ oo. By taking a subsequence, if 

£~=i 7{fcj < °°' Define a 
Yln=i Af(x„) ^n> wnere 5 Xn is a Dirac measure at x n . 



necessary, we can assume f{x n ) > n 2 for all n. Then, A 
probability measure P on M by P - - ^ x 1 

Then, J M f(x) dF(x) = ^ Yl^=i J^r\ = °°' wrnc h means if / is not bounded, then there 
exists a P G & such that j M f(x) dF(x) = oo. ■ 

The representation of 7^ in (fTOj) yields the embedding, 



II : & -> 0< 



1 y 



M 



k(-,x)dF(x) 



(12) 



as pro po sed bvlBerlinet and Thomas-Agnanl (|2004l . Chapter 4, Section 1.1) and lSmola et al 



( 20071 ). Berlinet and Thomas-Agnan ( 2004) derived this embedding as a generalization 



of 5 X i — y k(-, x) (see Chapter 4 of iBerlinet and Thomas-Agnanl (|2004l ) for details), while 
Gretton et al.l ( 20071 ) arrived at this embedding by choosing 3" = 3~fc in (P) . Since 7fc(P, Q) = 
1 1 n [IP] — II[Q]||j{, the question "When is 7^ a metric on <^?" is equivalent to the question 
"When is II injective?". Addressing these questions is the central focus of the paper and is 
discussed in Section [3l 

Before proceeding further, we present some other, equivalent representations of jk which 
will not only improve our understanding of 7&, but also be helpful in its computation. First, 
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note that by exploiting the reproducing property of A;, 7^ can be equivalently represented 
as 

2 

,2/ 



k(;x)dP(x) ~ / k(;x)d®(x) 
M JM 



X 



M 



k(-,x)dP{x)- I k(-,x)dQ(x), [ k(-,y)dF(y)- ( k(-,y) 

JM JM JM 



X 



(a) 



k(-,x)dF(x), / k(-,y)dF(y)) 

M JM I -K 

+■< / k(;x)dQ(x), [ k(-,y)dQ(y) 

IM JM I <K 

-2< / k(;x)dP(x), [ fc(.,y), 

I M JM I 'K 



k(x, y) d¥(x) tflP(y) + / / k(x, y) dQ(x) 
m JJm 

-2 k(x,y)d¥(x)dQ{y) 
JJm 



m 



k(x,y)d(F-Q)(x)d(F-Q)(y), 



(13) 
(14) 



where (a) follows from the fact that J M f(x) dF(x) = (/, J M k(-,x) dF(x))jt for all / 6 M, 
P G ^ (see proof of Theorem [1]), applied with / = j M k(-,y) dF(y). As motivated in 
Section [H 7^ is a straightforward sum of expectations of k, and can be computed easily, 
e.g., using (fT3"j) either in closed form or using numerical integration techniques, depending 
on the choice of k, P and Q. It is easy to show that, if A; is a Gaussian kernel with P 
and Q being normal distribu tions on then 7^. can be computed in a closed form (see 
Sriperumbudur et al.l (|2009bl . Section III-C) for examples). In the following corollary to 
Theorem [H we prove three results which provide a nice interpretation for 7^ when M = W 1 
and k is translation invariant, i.e., k(x, y) = tp(x-y), where tp is a positive definite function. 
We provide a detailed explanation for Corollary U] in Remark [5j Before stating the results, 
we need a famou s result due to Bochner, that characterizes ip. We quote this result from 
Wendlandl (|2005l . Theorem 6.6). 



Theorem 3 (Bochner) A continuous function i/j : 



is positive definite if and only 



if it is the Fourier transform of a finite nonnegative Borel measure A on 



i.e., 



ip(x) 



' dA(oj), x G 



(15) 



Corollary 4 (Different interpretations of 7^) (i) Let M = R and k(x,y) = tp(x-y), 
where tp : M — > R is a bounded, continuous positive definite function. Then for any P, Q G 



7ftOP>Q) = W J R J^ P ^ d H^) =■ \\<fo ~ <PQ\\i?(p*,A.), 

where (ftp and <pQ represent the characteristic functions of P and Q respectively. 



(16) 
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(ii) Supposes £ L l (R d ) is a continuous bounded positive definite function and 6(x) dx = 
1. Let V>(x) := M x ) = t-^it^x). A ssume that p and q are bounded uniformly continuous 
Radon- Nikodym derivatives of P and Q w.r.t. the Lebesgue measure, i.e., dP = pdx and 
dQ = qdx. Then, 

lim 7fc (P,Q) = \\p-q\\ L 2 m . (17) 



In particular, if\0(x)\ < C(l + \\xW2) d 6 for some C, e > 0, then [Fty holds for all bounded 
p and q (not necessarily uniformly continuous). 



(Hi) Suppose i> G L 1 (R d ) and y $ e L 1 (M d ). Then, 

7 fe(P, Q) = (27r)- d / 4 ||$ * P - $ * Q\\ mm , (18) 



$ : = fyf^\ and dA = (2ir)- d / 2 i> duj. Here, $ 



where <I> := ( y 1/) J and dA = (2ir) ' ipdw. Here, $ * P represents the convolution of 
and P. 

Proof (i) Let us consider (|14p with k{x,y) = ip(x — y). Then, we have 
7fc(P,Q) = // V(x-y)d(P-Q)(x)d(P-Q)(y) 



1 '" / / e"^ d(P - Q) (x) / e ij/Tw d(P - Q) (y) dA(u) 



J 



(4>p(uj) - 0q(o/)) ^p(w) - 0q(w)J dA(w) 
|</)p(w) - 0q(w)| 2 dA(u), 



wher e Bochner's theorem (Theorem [3]) is invoked in (a), while Fubini's theorem (jFollandl . 
1999, Theorem 2.37) is invoked in (b). 



(ii) Consider (fl3l) with k(x,y) = ipt(x — y), 

7fe(P,Q) = // it>t(x - y)p{x)p{y) dx dy + / / ^ t {x - y)q(x)q(y) dx dy 

J JR d J JR d 

-2 11 i/) t (x - y)p(x)q(y) dx dy 

J JRd 

(ipt * p)(x)p(x) dx + / (ipt * l){x)q{x) dx — 2 I (ip t * q)(x)p(x) dx. (19) 



Note that lim t ^ J^di^t *p)(x)p(x) dx = J Rd lim t _ 5>0 (^* *p){x)p(x) dx, by invoking the dom- 
inated convergence theorem. Since p is bounded and uniformly continuous, by Theorem [231 
(see Appendix A), we have p * ipt V uniformly as t — > 0, which means lim^o J^dii't * 
p)(x)p(x) dx = J Rd p 2 (x) dx. Using this in (fl9j) . we have 

lim 7 f(P,Q) = f (p 2 (x)+q 2 {x)-2p(x)q(x))dx=\\p-q\\ 2 L2 
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Suppose \0(x)\ < (1 + 1 1 1 1 2 ) rf £ for some C, e > 0. Since p G L 1 (M d ), by Theorem [26] 
(see Appendix A), we have (p * ipt)(x) —> p(x) as t — > for almost every x. Therefore 
lim^o Su.d(^t * p){x)p{x) dx = J Rd p 2 (x)dx and the result follows. 



(Hi) Since tp is positive definite, ip is nonnegative and therefore y ip is valid. Since yip G 
L 1 (]R rf ), $ exists. Define := ^>p — <pn. Now, consider 



1$ * 



x)\ dx 



[ [ $(x-y)d(F-Q)(y) 

JR d JR d 



dx 



1 



(2vr) d 

(c) 1 
(2vr) a! 
1 



y!$(u) e i(x-v) T " du d(F - Q)(y) 



dx 



iP(u){Mu) - M^)) e lx w du 



dx 



(2ir) d 



(d) 



\f^)\f^{[) 0p,q(w) <Mq(0 e^"^ dwd£dx 



(2vr) d 7 Rd 



do; d£ 



ip{u)y ip(£) 0p,qM <?V,q(£) 5(u) — £) du d£ 

ip(uj) \<f>p((jj) - <Aq(w)| 2 did 
= (2vr) d / 2 7 |(P,Q), 

where (c) and (d) are obtained by invoking Fubini's theorem. ■ 

Remark 5 (a) \16}) shows that 7^ is the L 2 -distance between the characteristic functions 
o/P and Q computed w.r.t. the non-negative finite Borel measure, K, which is the Fourier 
trans form ofip. IfipE then 176]) is a rephrase of the well known fact (WendlamA . 

200i Theorem 10.12): for any f eJi, 



(fa;. 



(20) 



Choosing f = (¥ — Q) * ip in \20\) yields f = (</>p — ^q)^ therefore the result in \16}) . 

(b) Suppose dh(oj) = (27r)~ d duo. Assume P and Q /iave p and g as Radon-Nikodym deriva- 
tives w.r.t. the Lebesgue measure, i.e., dF = pdx and dQ = qdx. Using these in (16\) , it 
can be shown that 7&(P, Q) = \\p — q\\L 2 (R d )- However, this result should be interpreted in 
a limiting sense as mentioned in Corollary ^ii) because the choice of dA(cj) = (2tt)~ ' d du: 
implies ip(x) = 5(x), which does not satisfy the co nditions of Cor ollary \4^i). It can be shown 
that i[)(x) = 5(x) is obtained in a limiting sense (Folland . \l999\ . Proposition 9.1): ipt — >■ 6 
0. 



in S>' d as t 



13 



Sriperumbudur, Gretton, Fukumizu, Scholkopf and Lanckriet 



(c) Choosing 6{x) = (2ir)~ d / 2 e~^ x W 2 / 2 in Corollary ^ii) corresponds to tpt being a Gaus- 
sian kernel (with appropriate normalization such that J^ d ipt{x)dx = 1). Therefore, jiTj ) 
shows that as the bandwidth, t of the Gaussian kernel approaches zero, 7^ approaches the 
L 2 -distance between the densities p and q. The same result also holds for choosing tpt a s 
the Laplacian kernel, B2 n +i-spline, inverse multiquadratic, etc. Therefore, 7fc(P, Q) can be 
seen as a generalization of the I? -distance between probability measures, P and Q. 

(d) The result in (T7\ ) holds if p and q are bounded and uniformly continuous. Since any 
condition on P and Q is usually difficult to check in statistical applications, it is better to 
impose conditions on ip rather than on P and Q. In Corollary UYii), by imposing addi- 
tional conditions on ipt, the result in fli7| ) is shown to hold for all P and Q with bounded 
densities p and q. The condition, \6(x)\ < C(l + 1 1 a? 1 1 2 ) rf s for some C, e > 0, is, e.g., 
satisfied by the inverse multiquadratic kernel, 9(x) = C(l + HxH 2 .) -7- , x £ M. d , r > d/2, where 
C={J Rd (l+\\xg)-dx)-\ 

(e) The result in Corollary\4Y H) ha s conn ections to the kernel density estimation in l?-sense 
using Parzen windows (Rosenblatt . 191 r d) . where tp can be chosen as the Parzen window. 



(f) (EHJ) shows that 7^ is proportional to the L 2 -distance between $ * P and $ * Q. Let <E> 
be such that $ is nonnegative and E L 1 (M d ). Then, defining $ := ( J* Kd dx) <3? = 

3>/y "0(0) = (J R d ip(x) dx) <!> and using this in fTS\) . we have 



7*1 



Q) = (2vr)- d / 4 v/V'(0) $*P-$*Q . (21) 

L 2 (R ti ) 



The r.h.s. of \21\) can be interpreted as follows. Let X, Y and N be independent random 
variables such that X ~ P, Y ~ Q and N ~ <I>. This means 7*. is proportional to the 
L 2 -distance computed between the densities associated with the perturbed random variables, 
X-\-N andY + N. Note that \\p — q\\L 2 (s. d ) * s the L 2 -distance between the densities of X and 
Y . Examples oftp that satisfy the conditions in Corollary^iii) in addition to the conditions 
on $ as mentioned here include the Gaussian and Laplacian kernels on M. d . The result in 

U8\) holds even if \J~tp ^ L 1 (M d ) as the proof of (Hi) can be handled using distribution 

theory. However, we assumed \J^p G L 1 (M d ) to keep the proof simple, without delving into 
distribution theory. 

Although we will not be using all the results of Corollary [4] in deriving our main results in 
the following sections, Corollary 2] was presented to provide a better intuitive understanding 
of 7^. To summarize, the core results of this section are Theorem Q] (combined with Propo- 
sition [2]) , which provides a closed form expression for 7^ in terms of the measurable and 
bounded k, and Corollary H](%), which provides an alternative representation for 7^ when k 
is bounded, continuous and translation invariant on M. d . 

3. Conditions for Characteristic Kernels 

In this section, we address the question "When is 7*. a metric on ^?". In other words, 
"When is II injective?" or "Under what conditions is k characteristic?". To this end, we 
start with the definition of characteristic kernels and provide some examples where k is such 
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Summary of Main Results 


Domain 


Property 




Characteristic 


Reference 


M 


k is integrally strictly pd 




Yes 


Theorem [7] 


R d 


Q = R d 




Yes 


Theorem [9] 


R d 


supp(V>) is compact 




Yes 


Corollary [TO] 


R d 


ft C R d , int(fi) / 




Yes 


Theorem [12] 


R d 


ncR d 




No 


Theorem [9] 


T d 


Ap(0) > 0, A^{n) > 0, Vn / 




Yes 


Theorem [TO] 


T d 


3n ^0|^(n) = 




No 


Theorem [TO] 



Table 1: The table should be read as: If "Property" is satisfied on "Domain", then k is 
characteristic (or not) to =2. 9 is the set of all Borel probability measures on a 
topological space, M . See Section 11,21 for the definition of integrally strictly pd 
kernels. When M = R d , k(x,y) = ip(x — y), where ifi is a bounded, continuous 
positive definite function on R d . ip is the Fourier transform of a finite nonnegative 
Borel measure, A, and := supp(A) (see Theorem [3] and footnote [5] for details). 
^i:={Pe^:^.e L 1 (R d )UL 2 (R d ), P < A and supp(P) is compact}, where # 
is the characteristic function of P and A is the Lebesgue measure. P <C A denotes 
that P is absolutely continuous w.r.t. A. When M = T d , k(x,y) = ip(x — y), 
where ■0 is a bounded, continuous positive definite function on T d . {A^{n)} c ^L_ 00 
are the Fourier series coefficients of tp which are nonnegative and summable (see 
Theorem [TO] for details). 

that 7^ is not a metric on 2? . As discussed in Section fLLll although some characterizations 
are available for k so that ^ is a metric on 9 ', they are difficult to check in practice. So, in 
Section we provide the characterization that if k is integrally strictly pd, then 7^ is a 
metric on ^. In Section [3.2I we present more easily checkable conditions wherein we show 
that if supp(A) = R d (see footnote [5] for the definition of the support of a Borel measure), 
then 7^ is a metric on 9 . This result is extended in a straightforward way to T d (d-Torus) 
in Section I3.31 The main results of this section are summarized in Table [1] 

We start by defining characteristic kernels. 
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Definition 6 (Characteristic kernel) A bounded measurable positive definite kernel k is 
characteristic to a set £} C 8? of probability measures defined on (M, A) if for P,Q £ J, 
7fc(P, Q) = 44> P = Q. k is simply said to be characteristic if it is characteristic to 3? . 
The RKHS, "K induced by such a k is called a characteristic RKHS. 

As mentioned before, the injectivity of IT is related to the characteristic property of k. 
If k is characteristic, then 7fc(P,Q) = 1 1 II [IP] - II[Q]|| M = => P = Q, which means 
P i — y J M k(-, x) dP(x), i.e., IT is injective. Therefore, when M = M. d , the embedding of 
a distribution to a characteristic RKHS can be seen as a generalization of the charac- 
teristic function, 4>p = L rf e l ^ ,x ^ d¥(x) . This is because, by the uniqueness theorem for 



- - . 1 I — ■ — I' 

characteristic functions (Dudley], |2002| . Theorem 9.5.1), <pf = (j)Q =^ P = Q, which means 



^ / Rd dP(x) is injective. So, in this context, intuitively e^ y ' x ' can be treated as the 
characteristic kernel, k, although, formally, this is not true as e l ^ y ' x ^ is not a pd kernel. 

Before we get to the characterization of characteristic kernels, the following examples 
show that there exist bounded measurable kernels that are not characteristic. 

Example 1 (Trivial kernel) Let k(x,y) = ij}(x — y) = C , V x,y G M. d with C > 0. Using 
this in {Z2p, we have 7^(P,Q) = C + C-2C = for any P, Q G & , which means k is not 
characteristic. 

Example 2 (Dot product kernel) Let k(x,y) = x T y, x,y G ~R d . Using this in $13\). we 
have 

7fc(P> Q) = /4W + MqMQ - = ||/xp - /zq|||, 

where /ip and represent the means associated with P and Q respectively, i.e., fip := 
Jjgd x dP(x). It is clear that k is not characteristic as 7&(P, Q) = =^ fip = /j,q =£> P = Q 
for all P,QG^. 

Example 3 (Polynomial kernel of order 2) Letk(x,y) = (l + x T y) 2 , x,y G M d . Using 
this in (Tjty, we have 

7fc(P,Q) = // (l + 2x T y + x T yy T x)d(P-Q)(x)d(F- 

J JRd 

= 2||/xp — /ZQ || 2 + ||Sp — Sq + //p/ip — /^(Q/UqIIf) 



where Sp and Sq represent the covariance matrices associated with P and Q respectively, 
i.e., Sp := J* Rd xx T dP(x) — /ip/ip. || • ||f represents the Frobenius norm. Since 7fc(P, Q) = 
=^ (/ip = /iQ and Sp = Eq) =£> P = Q for all P, Q G & , k is not characteristic. 

In the following sections, we address the question of when k is characteristic, i.e., for what 
k is 7^ a metric on 

3.1 Integrally strictly positive definite kernels are characteristic 



Compared t o the existing characterizations in literature (IGretton et all 120071 : iFukumizu et al. 



2008. l2009ah . the following result provides a more natural and easily understandable char 



acterization for characteristic kernels, which shows that integrally strictly pd kernels are 
characteristic to & . 



10 



Hilbert Space Embedding and Characteristic Kernels 



Theorem 7 (Integrally strictly pd kernels are characteristic) Ifk is integrally strictly 
positive definite on a topological space, M , then k is characteristic to 3* . 

Before proving Theorem [7J we provide a supplementary result in Lemma [8] that provides 
necessary and sufficient conditions for a kernel not to be characteristic. We show that 
choosing k to be integrally strictly pd violates the conditions in Lemma (SJ and k is therefore 
characteristic to 3?. 

Lemma 8 Let k be measurable and bounded on a topological space, M. Then 3P / 
Q, P, Q G 3 such that 7fc(P, Q) = if and only if there exists a finite non-zero signed 
Borel measure fj, that satisfies: 

(*) II m v) dK x ) d v(y) = o, 
(a) fi(M) = o. 

Proof ( -4= ) Suppose there exists a finite non-zero signed Borel measure , jjl tha t satisfies 



(i) and (ii) in Lemma El By the Jordan decomposition theorem (iDudleyl . I2002I . Theorem 
5.6.1), there exist unique positive measures /i + and ^ such that fi = fi + — /i - and fi + _L 

and [i~ are singular). By (ii), we have fj, + (M) = fj,~(M) =: a. Define P = q _1 /U + and 
Q = or x iT. Clearly, P / Q, P, Q G @> . Then, by (JSJ), we have 

7 , 2 (P,Q)= // k(x,y)d(F-Q)(x)d(F-Q)(y) = a- 2 [[ k(x, y) dfi{x) dfi(y) ® 0, 



I M J JM 

where (a) is obtained by invoking (i). So, we have constructed P 7^ Q such that 7fc(P, Q) = 0. 

( ) Suppose 3 P ^ Q, P, Q G 3 s such that 7 fe (P, Q) = 0. Let // = P - Q. Clearly /i is a 
finite non-zero signed Borel measure that satisfies /u(M) = 0. Note that by (|14p . 

7 l(¥,Q)= [[ k(x,y)d(F-Q)(x)d(¥-Q)(y)= I! k{x,y) d^x) d»(y), 
J Jm J Jm 

and therefore (i) follows. ■ 



Proof (of Theorem [7]) Since k is integrally strictly pd on M , we have 




k(x,y) drj{x)drj{y) > 0, 



for any finite non-zero signed Borel measure n. This means there does not exist a finite 
non-zero signed Borel measure that satisfies (i) in Lemma [H Therefore, by Lemma [H there 
does not exist P / Q, P,Q G & such that 7&(P, Q) = 0, which implies k is characteristic. ■ 

Examples of integrally strictly pd kernels on M. d include the Gaussian, exp(— <r||x — J/Hl)) °~ > 
0; the Laplacian, exp(— a\\x — y\\\), a > 0; inverse multiquadratics, (a 2 + \\x — y\\2)~ c , c > 
0, a > 0, etc, which are translation invariant kernels on M. d . A translation variant integrally 
strictly pd kernel, k, can be obtained from a translation invariant integrally strictly pd 
kernel, k, as k(x,y) = f(x)k(x,y)f(y), where / : M — > R is a bounded continuous function. 
A simple example of a translation variant integrally strictly pd kernel on R rf is k(x, y) = 
exp(o"x T y), a > 0, where we have chosen /(.) = exp(cr||.|||/2) and k{x,y) = exp(— a\\x — 
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2/111/2), a > 0. Clearly, this kernel is characteristic on compact subsets of Mr. The same 
result can also b e obtained from the fact that k is universal on compact subsets of M d 
(jSteinwartl . l200ll . Section 3, Example 1). 

Although the condition for characteristic k in Theorem [7] is easy to understand com- 
pared to other characterizations in literature, it is not always easy to check for integral 
strict positive definiteness of k. In the following section, we assume M = R d and k to be 
translation invariant and present a complete characterization for characteristic k which is 
simple to check. 



3.2 Characterization for translation invariant k on R d 

The complete , detailed proofs of the main results in this section are provided in Section [ 

Compared to Sriperumbudur et al. ( 20081 ). we now present simple proofs for these results 
without resorting to distribution theory. Let us start with the following assumption. 



Assumption 1 k(x,y) 
definite function on M - 



T>d 



y) where ip is a bounded continuous real-valued positive 



The following theorem characterizes all translation invariant kernels in M. d that are charac- 
teristic. 



Theorem 9 Suppose k satisfies Assumption [7J 
supp(A) = M. d , where A is defined as in 



Then k is characteristic if and only if 



First, note that the condition supp(A) = M. d is easy to check compared to all other, afore- 
mentioned characterizations for characteristic k. Table [2] shows some popular transla- 
tion invariant kernels on IR al ong with their Fourie r spec tra, ip and its support: Gaus- 



while Poisson (IB remaud 



2001 



sian, Laplacia n, i^n-n -splinq^l (Scholkopf and Sm ola. 200 2]) an d Sine kern els are ap e riodic 



Steinwarti. 12 001: 



Var 



niki . ll998h . 



Dirichlet (iBremaudl . 12001 



Scholkopf and Smolal . I2002T ) . Fejer (jBremaudl . 120011 ) and cosine kernels are periodic. Al- 
though the Gaussian and Laplacian kernels are shown to be characteristic by all the char- 
acterizations we have mentioned so far, the case of i?2n+i-splines is addressed only by 
Theorem [U which shows them to be characteristic (note that -B2n+i-sphnes being integrally 
strictly pd also follow from Theorem [9]). In fact, one can provide a more general result on 
compactly supported tr anslation invariant kernels , whic h we do later in Corollary 1101 The 
Matern class of kernels (IRasmussen and Williams! . 120061 . Section 4.2.1), given by 



k(x,y) = i/j(x - y) 



•>i-v 



Y{v) 



'2v\\x 



y\\2 



a 



V2v\\x - y\\ 2 



a 



v > 0, a > 0, (22) 



For a finite regular measure \i, there is a largest open set U with n(U) = 0. The complement of U is 
called the support of /x, denoted by supp(pi). 

A _B2n+i-spline is a _B„-spline of odd order. Only i?2n+i -splines are admissible, i.e. B n splines of odd 
order are positive definite kernels whereas the ones of even order have negative components in their 
Fourier spectrum, ip and, therefore, are not admissible kernels. In Tabled the symbol *^ 2 " +2 - 1 represents 
the (2n + 2)-fold convolution. An important point to be noted with the i?2n+i-spline kernel is that its 
Fourier spectrum, ip has vanishing points at cj = 2na, a £ Z\{0} unlike Gaussian and Laplacian kernels 
which do not have any vanishing points in their Fourier spectrum. Nevertheless, the spectrum of all 
these kernels has support R. 
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Kernel 

Gaussian 

Laplacian 

B 2 n+i-spline 
Sine 



tp(x) 



supp(^) 



exp 



exp(— <j\x\) 



a exp 



7r eH+u; 2 



(2n+2) 



i(crx) 



-- +2 (f) 



/2ir w 



2n + 2 



Poisson 

Dirichlct 

Fejer 
Cosine 



t'>-2<t cos(x) + l' < CT < 1 



1 sin 



2 



n+1 sin 2 § 
COs(crcc) 



v^FE?=-n*(w-i) {o,±i,...,±n} 

V^E"=-„(l-;Sl)%-j) {0,±l,...,±n} 

{-cr,(7} 



§ p(w - (t) + 5(w + a)] 



Table 2: Translation invariant kernels on R defined by ■0, their spectra, "0 and its support, 
supp(0). The first four are aperiodic kernels while the last four are periodic. The 
domain is considered to be R for simplicity. For x G M. d , the above formulae 
can be extended by computing ip{x) = W,j = x^{xj) where x = {x\ 1 . . . , Xd) and 

VK^) = 11^=1 rfiuj) where cj = (u\, . . . ,Wd). S represents the Dirac-delta function. 



is characteristic as the Fourier spectrum of tp, given by 



2 d+v ir d / 2 T(v + d/2)v u ( 2v 



r(u)<7 



2u 



-(v+d/2) 



U G 



(23) 



is positive for any uj G M. d . Here, T is the Gamma function, K v is the modified Bessel 
function of the second kind of order u, where v controls the smoothness of k. The case 
of v = ^ in the Matern class gives the exponential kernel, k(x,y) = exp(— \\x — yW^/a), 
while v — > oo gives the Gaussian kernel. Note that ip{x — y) in ([23]) is actually the inverse 
multiquadratic kernel, which is characteristic both by Theorem [7] and Theorem [9l 

By Theorem [9] the Sine kernel in Table [2] is not characteristic, which is not easy to show 
using other characterizations. By combining Theorem [7J with Theorem [9J it can be shown 
that the Sine, Poisson, Dirichlet, Fejer and cosine kernels are not integrally strictly pd. 
Therefore, for translation invariant kernels on R d , the integral strict positive definiteness of 
the kernel (or the lack of it) can be tested using Theorems [7J and [9j 

We note that, of all the kernels shown in Table [2] only the Gaussian, Laplacian and 
-B2n+i-spline kernels are integrable and their corresponding ip are computed using ([5]). The 
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other kernels shown in Tabl e [2] are not int egrable and their correspondin g ip have to be 
treated as distributions (see iFollandl ()1999l . Chapter 9) and iRudinl ()199ll . Chapter 6) for 
details), except for the Sine kernel whose Fourier transform can be computed in the I? 
sense 13 

Proof (Theorem [9]) We provide an outline of the complete proof, which is presented in 
Section [33] The sufficient condition in Theorem [9] is simple to prove and follows from Corol- 
lary whereas we need a supplementary result to prove its necessity, which is presented 
in Lemma [TB] (see Section l3. 5[) , Proving the necessity of Theorem is equivalent to showing 
that if supp(A) C R d , then 3P ^ Q, P,Q G &> such that 7 fc (P,Q) = 0. In Lemma [M 
we present equivalent conditions for the existence of P ^ Q such that 7fe(P, Q) = if 
supp(A) C R rf , using which we prove the necessity of Theorem [9] ■ 

The whole family of compactly supported translation invariant continuous bounded 
kernels on R d is characteristic, as shown by the following corollary to Theorem [9] 

Corollary 10 Suppose k ^ satisfies Assumption [1\ and supp(^) is compact. Then k is 
characteristic. 

Proof Since supp(V>) is compact in M. d , by the Paley- Wiener theorem (Theorem [29] in 
Appendix A) and Lemma [30] (see Appendix A), we deduce that supp(A) = R d . Therefore, 
the result follows from Theorem [9] ■ 

The above result is interesting in practice because of the computational advantage in dealing 
with compactly supported kernels. Note that proving such a general result for compactly 
supported kernels on R d is not straightforward (maybe not even possible) with the other 
characterizations. 

As a corollary to Theorem [9] the following result provides a method to construct new 
characteristic kernels from a given one. 

Corollary 11 Let k, k\ and k 2 satisfy Assumption [7] Suppose k is characteristic and 
k 2 7^ 0. Then k + k\ and k ■ k2 are characteristic. 

Proof Since k, k\ and k 2 satisfy Assumption [fl k + k\ and k 2 • k also satisfy Assumption [0 
In addition, 

(k + hjfay) := k(x,y) + k 1 (x,y) = tp{x - y) + ^{x - y) = / e~ i{x - y)TuJ d{k + A^uo), 
(k ■ k 2 )(x, y) := k(x, y)k 2 (x, y) = i>(x - y)ii 2 (x - y) = II e ~^-y) T ("+0 d A(cj) dA 2 (0 

J jRd 

e -i(*-») r «d(A*A 2 )(a;), 



(a) 



where (a) follows from the definition of convolution of measures (see iRudin (Il99ll . Sect 
9.14) for details). Since k is characteristic, i.e., supp(A) = M. d , and supp(A) C supp(A + Ai 



ion 



7. If / G L 2 (R d ), the Fourier transform F[f] '■= f of / is defined to be the limit, in the L 2 -norm, of the 
sequence {f n } of Fourier transforms of any sequence {fn} of functions belonging to ,5^d, such that f n 
converges in the L 2 -norm to the given function / G L 2 (R d ), as n — > oo. The function / is defined almost 
eve rywhere on R d and belongs to L 2 (R d ). Thus, F is a linear operator, mapping L 2 (R d ) into L 2 (R d ). 
See lGasquet and Witom ski (f999, Chapter IV, Lesson 22) for details. 
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we have supp(A + Ai) = R and therefore k + k\ is characteristic. Similarly, since supp(A) C 
supp(A * A2), we have supp(A * A2) = M d and therefore, k • &2 is characteristic. ■ 

Note that in the above result, we do not need k\ or &2 to be characteristic. Therefore, 
one can generate all sorts of kernels that are characteristic by starting with a characteristic 
kernel, k. 

So far, we have considered characterizations for k such that it is characteristic to 3? . 
We showed in Theorem [9] that kernels with supp(A) C M. d are not characteristic to & '. 
Now, we can question whether such kernels can be characteristic to some proper subset 
cS of 8?. The following result addresses this. Note that these kernels, i.e., the kernels 
with supp(A) C R d are usually not useful in practice, especially in statistical inference 
applications, because the conditions on £t are usually not easy to check. On the other 
hand, the following result is of theoretical interest: along with Theorem [9] it completes the 
characterization of characteristic kernels that are translation invariant on M d . Before we 
state the result, we denote P <C Q to mean that P is absolutely continuous w.r.t. Q. 

Theorem 12 Let := {P G & : (j) F G L 1 (R d )U L 2 (R d ) , P < A and supp(P) is compact}, 
where A is the Lebesgue measure. Suppose k satisfies Assumption U\ and supp(A) C M d has 
a non-empty interior, where A is defined as in U5\) . Then k is characteristic to 2?\. 

Proof See Section [331 ■ 

Although, by Theorem [9J the kernels with supp(A) C M. d are not characteristic to 
Theorem [12] shows that there exists a subset of to which a subset of these kernels are 
characteristic. This type of result is not available for the previously mentioned character- 
izations. An example of a kernel that satisfies the conditions in Theorem [12] is the Sine 
kernel, ip(x) = sm ^ x ^ which has supp(A) = [— a, a]. The condition that supp(A) C M d has a 
non-empty interior is important for Theorem 1121 to hold. If supp(A) has an empty interior 
(examples include periodic kernels), then one can construct P / Q, P,Q £ S*\ such that 
7fc(P,Q) = 0. This is illustrated in Example [5] which is deferred to Section [3.51 

So far, we have characterized the characteristic property of kernels that satisfy (a) 
supp(A) = or (b) supp(A) C M. d with int(supp(A)) 7^ 0. In the following section, we 
investigate kernels that have supp(A) C M. d with int(supp(A)) = 0, examples of which 
include periodic kernels on M d . This discussion uses the fact that a periodic function on M d 
can be treated as a function on T d , the d- Torus. 

3.3 Characterization for translation invariant k on T d 

Let M = x d =1 [0, Tj) and r := (n, . . . , r^). A function defined on M with periodic boundary 
conditions is equivalent to considering a periodic function on IR rf with period r. With no 
loss of generality, we can choose Tj = 2vr, Vj which yields M = [0, 2vr) d =: T d , called the 
d- Torus. The results presented here hold for any < tj < 00, Vj but we choose Tj = 2ir 
for simplicity. Similar to Assumption [T] we now make the following assumption. 

Assumption 2 k(x,y) = ip((x — y) mo d27r)> where ip is a continuous real-valued positive 
definite function on M = T d . 

Similar to Theorem [3] we now state Bochner's theorem on M = T . 
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Theorem 13 (Bochner) A continuous function ip : T d — > R is positive definite if and 
only if 

</>(*) = ^^(n)e ia;T ", x G T d , (24) 

n&Z d 

where : 7L d — > R + , A^(— n) = A^{n) and X^nez d A/>( n ) < 00 • are ca ^ e d ^ e Fourier 
series coefficients of if). 

Examples for ?/> include the Poisson, Dirichlet, Fejer and cosine kernels, which are shown in 
Table [2j We now state the result that defines characteristic kernels on T d . 

Theorem 14 Suppose k satisfies Assumption^ Then k is characteristic (to the set of all 
Borel probability measures on T d ) if and only if A^(0) > 0, A^{n) > 0, Vn / 0. 

The proof is provided in Section T3.5I and the idea is similar to that of Theorem [9j Based on 
the above result, one can generate characteristic kernels by constructing an infinite sequence 
of positive numbers that are summable and then using them in (|24p . It can be seen from 
Table [2] that the Poisson kernel on T is characteristic while the Dirichlet, Fejer and cosine 
kernels are not. Some examples of characteristic kernels on T are: 

(1) k(x,y) = e acos ^cos{asm{x-y)), 0<a<l o A/,(0) = 1, A^n) = fg, Vn ^ 0. 

(2) k(x, y) = — log(l — 2a cos(x — y) + q 2 ), \a\ < 1 o A^(0) = 0, A^n) = f,Vn^0. 

(3) k(x,y) = (tt- {x-y) mod2 n? <-> ^(0) = ^ , A^(n) = Vn ^ 0. 

( 4 ) %^) = co S ha^ 1 o S V^ ' a>QO ^(0) = M#)=e- aH .Vn^0. 

(5) *(*, V) = ^ OSh( t7n ( h ( ;g'— » ^ ^(0) = Mn) = ^Vn^0. 

The following result relates characteristic kernels and universal kernels defined on T d . 

Corollary 15 Let k be a characteristic kernel satisfying Assumption^ with A^(0) > 0. 
Then k is also universal. 

Proof Since k is charac t eristic with A/>(0) > 0, we have A^(n) > 0, Vn. Therefore, by 
Corollary 11 of ISteinwart fl200lh . k is universal. ■ 

Since k being universal implies that it is characteristic, the above result shows that the 
converse is not true (though almost true except that A^(0) can be zero for characteristic 
kernels). The condition on A^ in Theorem [TH i.e., A^(0) > 0, A^{n) > 0, Vn 7^ can be 
equivalently written as swpp(A^) = % d . Therefore, Theorems l9l and 1141 are of similar flavor. 



In fact, these results can be generalized to locally compact Abelian groups. iFukumizu et al 



(|2009bl ) shows that a bounded continuous translation invariant kernel on a locally compact 
Abelian group, G is characteristic to the set of all probability measures on G if and only if 
the support of the Fourier transform of the translation invariant kernel is the dual group of 
G. In our case, (R rf , +) and (T d , +) are lo cally compact Abelian g roups with (R d , +) and 
(Z d , +) as their respective dual groups. In IFukumizu et al.l (|2009bl ). these results are also 
extended to translation invariant kernels on non- Abelian compact groups and the semigroup 
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integrally strictly pd 



T. 7 



'K £ 77(M,P) 
VP S 8? and 
some r £ [1, oo) 



F. 4 



A 



characteristic 



strictly pd 



/' C. 15 



T. 9 



N I 
V 



universal kernel 



Figure 1: Summary of the relationship between various characterizations is shown along 
with the reference. The letters "C" , "F" , and "T" refer to Corollary, Footnote and 
Theorem respectively. For example, T. 7 refers to Theorem 7. The implications 
which are open problems are shown with "?" . A GE B indicates that A is a dense 
subset of B. Refer to Section T3.4I for details. 



3.4 Relation between various characterizations of characteristic kernels 

So far, we have presented various characterizations of characteristic ke rnels, which are easily 



checkable compared to the ch aracterizations proposed in literature (jGretton et al.l . 12007 



Fukumizu et~atl . 120081 . l2009bh . Now, it is of interest to understand the relation between 



these characterizations. A summary of the relationship between these characterizations is 
shown in Figure [H which is discussed below. 

Characteristic kernels vs. Integrally strictly pd kernels: It is clear from Theorem [7] 
that integrally strictly pd kernels on a topological space, M are characteristic, while it is 
not clear whether the converse is true or not. However, when k is translation invariant 
on M. d , then the converse holds. This is because if k is characteristic, then by Theorem [9l 
supp(A) = M d , where A is defined as in (fl"5j) . It is easy to check that if supp(A) = M. d , then 
k is integrally strictly pd. 

Integrally strictly pd kernels vs. Strictly pd kernels: The relation between in- 
tegrally strictly pd and strictly pd kernels shown in Figure Q] is straightforward, as one 
direction follows from footnote HI while the other direction is not true, which follows from 



Steinwart and Christmannl ([2008, Proposition 4.60, Theorem 4.62). However, if M is a 



finite set, then k being strictly pd also implies it is integrally strictly pd. 

Characteristic kernels vs. Strictly pd kernels: Since integrally strictly pd kernels are 
characteristic and are also strictly pd, a natural question to ask is, "What is the relation 
between characteristic and strictly pd kernels?" It can be seen that strictly pd kernels need 
not be characteristic because the sinc-squared kernel, k(x, y) = sin on which 



(x-y) 2 

has supp(A) = [— a, a] C 1 is strictly pd (jWendlandl . 120051 . Theorem 6.11), while it is not 



characteristic by Theorem However, for any general M, it is not clear whether k being 
characteristic implies that it is strictly pd. As a special case, if M = R d or M = T d , then 
by Theorems [9] and [T2l it follows that a translation invariant k being characteristic also 
implies that it is strictly pd. 
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Universal kernels vs. Char acteristic kern els: Gretton et al. ( 20071 ) have shown that 
if k is universal in the sense of lSteinwartJ (|200lh . then it is characteristic. As mentioned in 
Section l3.3( the converse is not true, i.e., if a kernel is characteristic, then it need not be 
universal, which follows from Corollary [T5l Note that in this case, M is assumed to be a 
compact m etric space. The notio n of universality of kernels was extended to non-compact 
domains bv lMicchelli et al. I (|2006h : k is said to universal on a non-compact Hausdroff space, 
M, if for any compact Z C M, the set K(Z) := span{£;(-, y) : y G Z} is dense in C(Z) w.r.t. 
the supremum norm, where C(Z) represents the space of continuous functions defined on 
Z. It is to be no t ed that when M is co mpact, this notion of universality is same as that of 
Steinwartl (|200lh . iMicchelli etaD (|2006l . Proposition 15) have provided a characterization of 
universality for translation invariant kernels on R d : k is universal if A(supp(A)) > 0, where 
A is the Lebesgue measure and A is defined as in (|15p . This means, if a translation invarian t 
kernel on M. d is characteristic, then it is also universal in the sense of lMicchelli et al.1 (|2006h . 
while the converse is not true. However, the relation between these notions for a general 
no n-compact Hausdorff space, M is not clear. 

Fu kumizu et al. I (|2008l . l2009bh have shown that k is characteristic if and only if "Ji + M. is 
dense in L r (M, P) for all P G & and for some r G [1, oo). This means, if k is characteristic, 
then Ji + R <± L r (M, P), which implies "K <± L r (M, P) for all P G & and for some r G [1, oo). 
Clearly, the converse is not true (refer to Figure [JJ for the definition of <±). However, if 
constant functions are included in JC, then it is easy to see that the converse is also true. 

Universal kernels v s. Strictly pd kernels: I f a ker nel is universal, then it is strictly 
pd, which follows from lSteinwart and Christmannl (120081 . Definition 4.53, Proposition 4.54, 
Exercise 4.11). On the other hand, if a kernel is strictly pd, then it need not be u niver - 
sal, which follows f rom the results due to Dahmen and Micchelli ( 19871 ) and Pinkus ( 20041 ) 
for Taylor kernels (Steinwart and Christmann I2OO8I . Lemma 4.8. Corollary 4.57). Refer to 
Steinwart and Christmannl fj2008l . Section 4.7, p. 161) for more details. 



Recently, in Sriperumbudur et al. ( 2010al lbl). we carried out a thorough study of relating 
characteristic kernels to various notions of universality, wherein we addressed some open 
questions mentioned in the above discussion and Figure [Tj This is done by relating uni- 
versality to the injective embedding of regular Borel measures into an RKHS, which can 
therefore be seen as a generalization of the notion of characteristic kernels, as the latter 
deal with the injective RKHS embedding of probability measures. 



3.5 Proofs 

First, we present a supplementary result in Lemma [TBI that will be used to prove Theorem[9j 
The idea of Lemma [16] is to characterize the equivalent conditions for the existence of 
P / Q such that 7&(P, Q) = when supp(A) C R . Its proof relies on the properties of 
characteristic functions, which we have collected in Theorem [27] in Appendix A. 



Lemma 16 Let ^ := {IP G ^ : ^ G L 1 (M d ) U L 2 (R d ) andP < A} ; where A is the 
Lebesgue measure. Suppose k satisfies Assumption^ and supp(A) C M. d ; where A is defined 
as in (T2|). Then, for any Q G BP / Q, P G such that j k (F, Q) = if and only if 
there exists a non-zero function 9 : W 1 — > C that satisfies the following conditions: 
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(i) 9 G (L 1 (R d )UL 2 (R d ))nC b (R d ) is conjugate symmetri^, i.e., 9(x) = 8(-x), Vx G R d , 

(ii) 9 y G L l (R d ) n (L 2 (R d ) U C b (R d )), 
(Hi) f Rd \9(x)\ 2 dA(x) = 0, 

(iv) 0(0) = 0, 

(v) inf xmd {dV(x)+q(x)}>0. 

Proof Define L 1 : = L 1 (R a! ), I? := L 2 (R d ) and C b := C b (R d ). 

Suppose there exists a non-zero function 9 satisfying (i) - (v). For any Q G we 
have 0q G (L 1 U L 2 ) n When 0q G L 1 D Cft, the Riemann-Lebesgue lemma (Lemma [28] 
in Appendix A) implies that q = [4>q] v G L 1 n C b , where q is the Radon-Nikodym derivative 
of Q w.r.t. A. When (pq G L 2 n the Fourier transform in the L 2 sense (see footnote [7]) 
implies that q = [0^] v G L 1 n L 2 . Therefore, g G L 1 n (L 2 U C 6 ). Define p:= q + 9 y . Clearly 
p G L 1 n (L 2 U C 6 ). In addition, ^ = p = q + 0V = 0^ + 9 G (L 1 U L 2 ) n C 6 . Since is 
conjugate symmetric, 9 y is real valued and so is p. Consider 

/ p(x)dx= I q(x)dx+ I 9 v (x)dx = l + 9(0) = l. 

jR d JR d JR d 

(v) implies that p is non- negative. Therefore, p is the Radon-Nikodym derivative of a 
probability measure P w.r.t. A, where P is such that P ^ Q and P G &q. By (|16p . we have 

7 f (P, Q) = f \Mx) ~ M*)\ 2 dA(x) = [ \9(x)\ 2 dA(x) = 0. 
jR d JR d 

( => ) Suppose that there exists P/Q, P,QG^ such that 7 fc (P, Q) = 0. Define 9 := 4> F - 
<Pq. We need to show that 9 satisfies (i) - (v). P, Q G implies 0p, </>q G (L 1 L)L 2 )nC b and 
p,g G L l C\{L 2 UC b ). Therefore, 9 = P -0 Q G (L 1 UL 2 )nC 6 and # v = p-g G L l n(L 2 UC b ). 
By Theorem [27] (see Appendix A), (ft$> and ^>q are conjugate symmetric and so is 9. Therefore 
9 satisfies (i) and 9 V satisfies (ii). 9 satisfies (iv) as 

9(0)= / 9 v (x)dx= / (p{x) - q{x)) dx = 0. 

Non-negativity of p yields (vj. By (fl6|) . 7fc(P, Q) = implies (mj. ■ 

Remark 17 Note that the dependence of 9 on the kernel appears in the form of (Hi) in 
Lemma[Tb\ This condition shows that A(supp(#) n supp(A)) = 0, i.e., the supports of 9 and 
A are disjoint w.r.t. the Lebesgue measure, A. In other words, supp(#) C cl(M rf \supp(A)). 
So, the idea is to introduce the perturbation, 9 over an open set, U where A(U) = 0. The 
remaining conditions characterize the nature of this perturbation so that the constructed 
measure, p = q + # v , is a valid probability measure. Conditions (i), (ii) and (iv) simply 
follow from 9 = 4>p — (/)q, while (v) ensures that p(x) > 0, V x. 

8. Note that Re[8] and Im[8] are even and odd functions in R d . 
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Using Lemma PT6] we now present the proof of Theorem [9] 

Proof (Theorem [9]) The sufficiency follows from ([IB]): if supp(A) = R d , then 7^(P,Q) = 
Jjjd \4>w( x ) ~ •AqK^)! 2 dA(x) = =>• <pf = <j)Q, a.e., which implies P = Q and therefore k is 
characteristic. To prove necessity, we need to show that if supp(A) C R rf , then there exists 
P/Q, P,Qe 9? such that 7fc(P,Q) = 0. By Lemma [TBJ this is equivalent to showing 
that there exists a non-zero 9 satisfying the conditions in Lemma [16] Below, we provide a 
constructive procedure for such a 9 when supp(A) C R rf , thereby proving the result. 
Consider the following function, fp iLJo G C°°(R rf ) supported in [coq — f3, ojq + 

A - a ' 2 

ffS,w {u) = ll^^ik) With h a,b(y) ■= t[-a,a](y ~ h ) e , (25) 

3=1 

where w = (wi, . . . ,u d ), lu = (w ,i, • • • ,wo,d), /3 = (Pi, ■ ■ -,Pd), « G & G R and y G R. 

Since supp(A) C R rf , there exists an open set U cM. d such that A(C7) = 0. So, there exists 
P G P++ an d > P (element-wise inequality) such that [ujq — P, uq + /3] C U. Let 

9 = a(f Pm + .fo-^), a G R\{0}, (26) 

which implies supp(#) = [— wo— P, —uq+P]U[ujo—P,ujo+P] is compact. Clearly 9 G ^ C 
which implies # v G C L 1 (M d ) nL 2 (l rf ). Therefore, by construction, 9 satisfies (i) - (iv) 
in Lemma[JB] Since j Rd 9 V (x) dx = 9(0) = (by construction), 9 y will take negative values, 
so we need to show that there exists Q G &o such that (v) in Lemma [IB] holds. Let Q be 
such that it has a density given by 



q(x) 



^ II (1 + ^.p)/ ^ € ^ where Ci = (jf (1 + kilV^i) . ( 27 ) 



and x = (xi, . . . , x^). It can be verified that choosing a such that 

Ci 

< \a\ < ; < oo, 



ensures that 9 satisfies (v) in. Lemma [TBI The existence of finite a is guaranteed as h a ,o G 
$>i C o5^i which implies h^ Q £ S\, Va. We conclude there exists a non-zero as claimed 
earlier, which completes the proof. ■ 

To elucidate the necessity part in the above proof, in the following, we present a simple 
example that provides an intuitive understanding about the construction of 9 such that for 
a given Q, ¥ ^ Q can be constructed with 7^(P, Q) = 0. 

Example 4 Let Q be a Cauchy distribution in R, i.e., q(x) = ^h^t\ with characteristic 
function, (/>q(uj) = ^= e ~' w ' ^ n ^ 1 (R)- Let tp be a Sine kernel, i.e., ip(x) = \Jl^—^r^ with 
Fourier transform given by ip(uj) = l[_ J g )J g](w) and supp(?/>) = [— P,P] C R. Let 9 be 

cx r at 

d (") = 2i [*fl[_f|]Mj *[S(cj-uo)-S(u + u3o)], (28) 
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where \uq\ > ( ^j" 2 ) (3, N > 2 and a/0. *± represents the N-fold convolution. Note that 9 
is such that supp(#) n supp(V>) is a null set w.r.t. the Lebesgue measure, which satisfies (Hi) 
in Lemma [TM It is easy to verify that 9 G L 1 (IR) H L 2 (IR) n C;,(M) also satisfies conditions 
(i) and (iv) in Lemma [To\ 9 y can be computed as 

2 N a ^ N ft 
"\x) = ^=sin(a;ox) -A^., (29) 



>2lT X 

and 6> v G L X (IR) n L 2 (R) n C b {R) satisfies (ii) in LemmaEB Choose 

V2 



< a < 



sup x (1 + x 2 ) S m(u x)smc N ' ^ 



2- 



(30) 



where sinc(x) := sm j^. x ^ ■ Define g(x) := sin(a;ox)sinc Ar ■ Since g G 5?\, < supj. |(1 + 

x 2 )g(x)\ < oo and, therefore, a is a finite non-zero number. It is easy to see that 9 satisfies 
(v) of Lemma 1 1 61 Then, by Lemma YHk there exists P / Q, P G &o, given by 

1 2 N a , N sm (if) 

7r(l + x^) V2vr x iV 

with = 4>q + 9 = (pq + id i where 9i = lm[9] and 4>¥ G L 1 (M). So, we have constructed 
P ^ Q, suc/i £/tcrf 7fc(P,Q) = 0. Figured shows the plots ofip, ip, 9, 9 y , q, p and \cpp\ 
for /3 = 2-7T, N = 2, uj = Ait and a = 

We now prove Theorem 1121 

Proof (Theorem [12]) Suppose 3P / Q, P,Q G ^ s u ch th at 7&(P,Q) = 0. Since any 
positive Borel measure on M. d is a distribution ( Rudinl . Il99ll . p. 157), P and Q can be 



treated as distributions with compact support. By the Paley- Wiener theorem (Theorem! 
in Appendix A), <fip and 4>q are entire on C d . Let 9 := <fip — 0q. Since 7fc(P, Q) = 0, we 
have from (fl6|) that J Kd \9(u)\ 2 dA(uj) = 0. From Remark [T71 it follows that supp(#) C 
cl(M d \supp(A)). Since supp(A) has a non-empty interior, we have supp(6>) C M. d . Thus, 
there exists an open set, U C R d such that 9(x) = 0, Vx G U. Therefore, by Lemma [301 (see 
Appendix A), 9 = 0, which means (ftp = 4>q =4> P = Q, leading to a contradiction. So, there 
does not exist P / Q, P,Q G such that 7fc(P, Q) = and k is therefore characteristic to 

■ 

The condition that supp(A) has a non-empty interior is important for Theorem 1121 to hold. 
In the following, we provide a simple example to show that P 7^ Q, P, Q G &\ can be 
constructed such that 7fc(P, Q) = 0, if k is a periodic translation invariant kernel for which 
int(supp(A)) = 0. 

Example 5 Let Q be a uniform distribution on [— /3,/3] C R, i.e., q(x) = jp^[-/3,/3](x) with 
its characteristic function, </>q(w) = - J^- sm ^) g L 2 (M). Let t/j be the Dirichlet kernel with 
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period t, where r < /3, i.e., ip(x) = — sin — and = v2vr^ J - = _ i 5 \ oo ^r-J with 

supp(^) = { — , j G {0, ±1, . . . , ±1}}. Clearly, supp(^) has an empty interior. Let 9 be 



0{u) 



%y/2a . (urr_\ sin 2 (*£) 



sm 



i^/n V 2 



with a < It is easy to verify that 9 G L^R) n L 2 (R) n C&(R), so 9 satisfies 



(32) 



2 ^n 



Lemma [751 5ince #(w) = w = — , / G Z ; supp(6>) n supp(^) C supp(^) is a set of 
Lebesgue measure zero, so {Hi) and (iv) in Lemma{T^are satisfied. 9 V is given by 



( 2a\x+l\ 



\ X ) = 



a. 



a 



-T < X < 
< X < T 

otherwise, 



(33) 



where 9 y G L 1 (R) n L 2 (R) n C&(R) satisfies (ii) in LemmaUM Now, consider p = q + 6> v , 
which is given as 



p{x) 



2/3' 



x G [-/3, -r] U [r, 0\ 



t + 23 ~ a ' xe[-T,o\ 
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j_ _ 2a\x-^\ 
a + 2/3 t 

o, 



x G [0, t] 
otherwise. 



(34) 



Clearly, p(x) > 0, Vx and J R p(x) dx = 1. ^>p = ^>q + = 0q + where 9i = lm[9] and 
0p G L 2 (R). H^e /iai>e therefore constructed P ^ Q, snc/i i/iai 7fc(P, Q) = 0, where P and 



are compactly supported in R characteristic functions in L (M), i.e., ... . 



G 

Figure \3\ shows the plots of ip, ip, 9, 9 y , q, (j)Q, p and |$p| /or r = 2, / = 2, /3 = 3 and 
a = |. 

We now present the proof of Theorem [14l which is similar to that of Theorem [9l 
Proof (Theorem I14j) ( <J= ) From (fT4"j) . we have 



7'(F,Q) 



^(z-y)d(P-Q)(aOd(P-Q)(y) 



(6) 



(c) 



E A 



n e 



i(x—y) T n 



E A>(™) 

(2vr) M ^ ^(n)|^ P (n)-A Q (n)| 2 , 



(35) 



where we have invoked Bochner's theorem (Theorem 1 13p in (a), Fubini's theorem in (b) and 

1 



A P (n) := 



(2vr) c 



dP(x), n G Z, 



(36) 
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in (c). Af is the Fourier transform of P in T d . Since A^(0) > and A^(n) > 0, Vn ^ 0, we 
have Ap(n) = ^4q(?i), Vn. Therefore, by the uniqueness theorem of Fourier transform, we 
have P = Q. 

( => ) Proving the necessity is equivalent to proving that if A^,(0) > 0, A^,{n) > 0, Vn / 
is violated, then k is not characteristic, which is equivalent to showing that 3P ^ Q such 
that 7fc(P, Q) = 0. Let Q be a uniform probability measure with q(x) = j^a3. ; Vx G T . 
Let k be such that A^{n) = for some n = uq ^ 0. Define 

M*) := ( , ff U i , " ^ t"° , (37) 
r ^ y \ A Q (n) + 0(n), n = ±n ' v ' 



where Ag(n) = (^d^no and ^(~ n o) = #("<o)- So, 

p(x) = £ A P {n)e lxTn = — + 0(n o )e" T "» + #(-n ) e — (38) 

Choose 0(no) = ia, a G P. Then, p(x) = ^ 2 ^ d — 2a sin(x T ?io). It is easy to check that p 
integrates to one. Choosing \a\ < 2 (2n) d ensures that Pi x ) — 0,Vx G T d . By using A$>{n) in 
(j35]h it is clear that 7 fc (P, Q) = 0. Therefore, 3 P / Q such that 7 & (P, Q) = 0, which means 
A; is not characteristic. ■ 



4. Dissimilar Distributions with Small 7^ 

So far, we have studied different characterizations for the kernel k such that 7^ is a metric 
on IP . As mentioned in Section [H the metric property of 7^ is crucial in many statistical 
inference applications like hypothesis testing. Therefore, in practice, it is important to use 
characteristic kernels. However, in this section, we show that characteristic kernels, while 
guaranteeing 7^ to be a metric on may nonetheless have difficulty in distinguishing 
certain distributions on the basis of finite samples. More specifically, in Theorem QjJ] we 
show that for a given kernel, k and for any e > 0, there exist P 7^ Q such that 7fe(P, Q) < s. 
Before proving the result, we motivate it through the following example. 

Example 6 Let P be absolutely continuous w.r.t. the Lebesgue measure on M with the 
Radon- Nikodym derivative defined as 

p(x) = q(x) + aq(x) sin(z/7rx), (39) 

where q is the Radon- Nikodym derivative o/Q w.r.t. the Lebesgue measure satisfying q{x) = 
q(—x), Vx and a G [— 1,1]\{0}, v G P\{0}. It is obvious that P / Q. The characteristic 
function o/P is given as 

0p(w) = 4>q(uj) - — [0q(w - un) - 4>q(uj + utt)] ,wEK, (40) 

where <j)Q is the characteristic function associated with Q. Note that with increasing p 
has higher frequency components in its Fourier spectrum and therefore appears more noisy 
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as shown in Figure^ In Figure^ (a-c) show the plots of p when q = Uf— 1,1] (uniform 
distribution) and (d-d ) show the plots ofp when q = N(0, 2) (zero mean normal distribution 
with variance 2) for v = 0, 2 and 7.5 with a = \. 

Consider the B\-spline kernel on R given by k(x,y) = tp{x — y) where 

^ = { 1_ N ' l^rL ' (41) 
with its Fourier transform given by 

~ . 2^2 sin 2 % 

* w) = 7i (42) 

Since if) is characteristic to & , 7&(P, Q) > (see Theorem^). However, it would be 
of interest to study the behavior of 7jfc(P, Q) as a function of v. We study the behav- 
ior of 7f(P, Q) thro ugh its unbiased, consistent estimator^ ^ u (m,m) as considered by 
Gretton et all \2Mfi . Lemma 7). 



Figure\5j(a) shows the behavior of 7| u (m,m) as a function of v for q = U[— 1,1] and 

q = N(0, 2) using the B\-spline kernel in |^ Since the Gaussian kernel, k(x, y) = e~^ x ~ y ^ 2 
is also a characteristic kernel, its effect on the behavior 0/7^ u (m, m) is shown in Figure^b) 
in comparison to that of the B\ -spline kernel. 

In Figured we observe two circumstances under which 7^ may be small. First, 7? u (m, m) 
decays with increasing \u\, and can be made as small as desired by choosing a sufficiently 
large \u\. Second, in Figure\3{a), 7^ u (m,m) has troughs at v = — where ujq = {uj : ip(uj) = 
0}. Since 7^ u (m,m) is a consistent estimate o/7jj!(P, Q), one would expect similar behavior 
from 7fc(P,Q). This means that, although the B\-spline kernel is characteristic to & , in 
practice, it becomes harder to distinguish between P and Q with finite samples, when P is 
constructed as in with u = ^ . In fact, one can observe from a straightforward spectral 
argument that the troughs in 7?(P, Q) can be made arbitrarily deep by widening q, when q 
is Gaussian. 

For characteristic kernels, although 7fc(P, Q) > when P 7^ Q, Example [6] demonstrates that 
one can construct distributions such that 7? u (m, m) is indistinguishable from zero with high 
probability, for a given sample size m. Below, in Theorem 1 191 we explicitly construct P 7^ Q 
such that \Pcpi — is large for some large I, but 7fc(P, Q) is arbitrarily small, making 

it hard to detect a non-zero value of 7fc(P, Q) based on finite samples. Here, ipi 6 L 2 (M) 
represents the bounded orthonormal eigenfunctions of a positive definite integral operator 
associated with k. Based on this theorem, e.g., in Example EJ the decay mode of 7^ for 
large \v\ can be investigated. 

Consider the formulation of 73- with 3" = 3~ fc in ([Tj). The construction of P for a given 
Q such that 7fc(P, Q) is small, though not zero, can be intuitively understood by re- writing 



Let {Xj}™ =1 and {Yj}J! =1 be random samples drawn i.i.d. from P and Q respectively. An unbiased 
empirical estimate of t£(P,Q), denoted as 7fe ltt (m,m) is given by jl u (m,m) = m( ^_ 1) YT^j K z h Z i)> 
which is a one-sample [/-statistic with h(Zi,Zj) := k(Xi,Xj) + k( Yi , Yj ) — k(Xi , V , ) — k (Xj , Yj ) , where 
Zi, . . . , Zm are m i.i.d. random variables with Zj := {Xj,Yf). See lGretton et all l|2007l . Lemma 7) for 
details. 
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-1 1 



(a) 




(a') 




-10 12 



(b) 



(b') 





(c) 




(c') 



Figure 4: (a) q = U[— 1,1], (a') q = ZNT(0, 2). (b-c) and (b'-c') denote p(x) computed as 
p(x) = q(x) + \q{x) sin(^7rx) with q = U[— 1, 1] and q = N(0, 2) respectively, v is 
chosen to be 2 in (b,b') and 7.5 in (c,c'). See Example [6] for details. 



Uniform 
Gaussian 




Uniform 
Gaussian 




(a) 



(b) 



Figure 5: Behavior of the empirical estimate of 7^(P, Q) w.r.t. v for (a) the Si-spline 
kernel and (b) the Gaussian kernel. P is constructed from Q as defined in (|39p . 
"Uniform" corresponds to Q = U[— 1, 1] and "Gaussian" corresponds to Q = 
N(0,2). m = 1000 samples are generated from P and Q to estimate 7^(P, Q) 



through 7. ( 



m, m) 



This is repeated 100 times and the average 7^ , 



(m,m) is 



plotted in both figures. Since the quantity of interest is the average behavior of 
7^ n (m,m), we omit the error bars. See Example [6] for details. 



CO) as 



7fc(P, ( 



sup 



|P/ - Q/| 



(43) 
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When P / 



"/ — Q/| can De large for some f £ 'K. However, 7fc(P, Q) can be made 



small by selecting P such that the maximization of 1 rrjjr^r over J{ requires an / with large 
||/||j{. More specifically, higher order eigenfunctions of the kernel ((pi for large I) have large 
RKHS norms, so, if they are prominent in P and Q (i.e., highly non-smooth distributions), 
one can expect 7&(P, Q) to be small even when there exists an I for which |Ptp; — ©Vzl is 



large. To this end, we need the following lemma, which we quote from Gretton et al. ( 20041 . 
Lemma 6). 

Lemma 18 ( Gretton et al. ( 20041 )) Let 3" be the unit ball in an RKHS (!K,k) defined 
on a compact topological space, M, with k being measurable. Let ipi £ L 2 (M,/j,) be absolutely 
bounded orthonormal eigenfunctions and Xi be the corresponding eigenvalues (arranged in 
a decreasing order for increasing I) of a positive definite integral operator associated with 
k and a a-finite measure, /x. Assume A^ -1 increases superlinearly with I. Then, for / 6 J 

where f(x) = Y^LifjVjix), fj '■= {f^j)^{M,^), we have E£Lil/jl < 00 and for every 
e > 0, 3l e N such that \ft\ <eifl>l . 

Theorem 19 (P ^ Q can have arbitrarily small 7^) Assume the conditions in Lemma\18\ 
hold. Then, there exist probability measures P 7^ Q defined on M such that 7fc(P, Q) < £ for 
any arbitrarily small e > 0. 

Proof Suppose q be the Radon-Nikodym derivative associated with Q w.r.t. the cr-flnite 
measure, [i (see Lemma [T8|) . Let us construct p(x) = q(x) + aie(x) + T(fi(x) where e{x) = 
\m{x). For P to be a probability measure, the following conditions need to be satisfied: 



[aie(x) + Tipi(x)] dfx(x) = 0, 



Al 



min [q(x) + aie(x) + T(pi(x)} > 0. 



(44) 
(45) 



Expanding e(x) and f(x) in the orthonormal basis {<pi}fli, we get e(x) = Ylt=i e i¥i( x ) an d 
f( x ) = T,iZifm( x ), where := (e, <^)z 2 (M )At ) and ft := (f,(pi)^ M ,n)- Therefore, 



P/-Q/ 



f(x) [aie(x) + T(pi(x)] dn{x) 



M 



M 



ai^ejifjix) +T<pi(z 



t=i 



(46) 



where we used the fact thatrl (ipj, (pt)L 2 (M,n) = Rewriting (f44l) and substituting for e{x) 



gives 



[aie(x) +T<pi(x)] 



M 



„ 00 
dji{x) = \ e(x)[aie(x) + Tipi(x)\ d[i(x) = a; T^ef + = 0) 
Jm ' .7=1 



10. Here, 8 is used in the Kronecker sense. 
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which implies 



Now, let us consider 



rei 



Eoo 
7=1 



(47) 



a\&t + tSu- Substituting for a\ gives 



e t ei 



oo ~2 



Eoo 
7=1 



t8, 



tl - Tpth 



(48) 



where pu := ^S e '-j • By Lemma [T8| 5^£i Nl < oo =^ YlT=i^i < °°5 an d choosing large 
enough / gives |py| < 77, Vi, for any arbitrary r/ > 0. Therefore, |P<pt — Q^t| > t — rj for 
i = / and \P(ft — Q<Pt| < V for t ^ I, which means P ^ Q. In the following, we prove that 
7fc(P, Q) can be arbitrarily small, though non-zero. 

Recall that 7fe(P, Q) = sup||j|| ;K<1 |P/ — Q/|. Substituting for a; in (J3SJ) and replacing 
|P/ - Q/| by HMD in 7ft (P, Q), we have 



sup 

ifj}T=i 



ft 



(49) 



where we used the definition of RKHS norm as 



Ojl-Pjh 



IS 



f' 2 

i= l - and ; 

a convex quadratically constrained quadratic program in {fj} ( jL 1 . Solving the Lagrangian 
yields /• ^ 



Therefore, 



7fe(P,Q) 



\ 



(50) 



because (i) by choo sing sufficiently large I , \ qjA < e, Vj, for any arbitrary e > 0, and fiij 
A; — )• as I — > oo (jScholkopf and Smolal . 120021 . Theorem 2.10). Therefore, we have con- 
structed P Q such that 7fc(P, Q) < e for any arbitrarily small e > 0. ■ 



5. Metrization of the Weak Topology 

So far, we have shown that a characteristic kernel, k induces a metric, 7& on . As 
motivated in Section 11.1.31 an important question to consider that is useful both in theory 
and practice would be: "How strong or weak is 7^ related to other metrics on ^?" This 
question is addressed in Theorem [2T1 wherein we compared 7^ to other metrics on @* like 
the Dudley metric (/3), Wasserstein distance (W), total variation distance {TV) and showed 
that 7^ is weaker than all these metrics (see footnote [3] for the definition of "strong" and 
"weak" metrics). Since 7^ is weaker than the Dudley metric, which is well known to induce 
a topology on & that coincides with the standard topology on , called the weak-* (weak- 
star) topology (usually called the weak topology in probability theory), the next question 
we are interested in is to understand the topology that is being induced by 7^. In particular, 
we are interested in determining the conditions on k for which the topology induced by 7^ 
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coincides with the weak topology on 8P. This is answered in Theorems 1231 and 1241 wherein 
Theorem [23] deals with compact M and Theorem [25] provides a sufficient condition on k 
when M = M. d . The proofs of all these results are provided in Section 15.11 Before we 
motivate the need for this study and its implications, we present some preliminaries. 

The weak topology on is the weakest topology such that the map P i— > J M f dP is 
continuous for all / G Cf,(M). For a metric space, (M,p), a sequence P n of probability 
measures is said to converge weakly to P, written as P n — >■ P, if and only if J M f dP n — > 
f M fdF for every / G Cfe(M). A metric 7 on & is said to metrize the weak topology if 
the topology induced by 7 coincides with the weak topology, which is defined as follows: if, 
for P,Pi,P 2 , (P„4P» 7(P«,P) "-^ 0) holds, then the topology induced by 7 

coincides with the weak topology. 

In the following, we collect well-known results on the relation between various metrics on 
which will be helpful to understand the behavior of these metrics in relation to others. 
Let (M,p) be a separable metric space. The Prohorov metric on (M,p), defined as 



inf{e > : P{A) < 



+ e, VBorel sets A}, 



(51) 



metrizes the weak topology on & ([Dudley! . I2002I . Theorem 11.3.3), where P,Q e ^ and 
A e := {y £ M : p(x,y) < e for some x G A}. Since the Dudley metric is related to the 
Prohorov metric as 

~/3(P, Q) < ? (P, Q) < 2 V / /3(P,Q), (52) 



it also metrizes the weak topology on & ( Dudley . 20021 . Theorem 11.3.3). The Wasserstein 
distance and total variation distance are related to the Prohorov metric as 



< W(P,Q) < (diam(M) + l)?(P,i 



(53) 



and 



< rv(p, ( 



(54) 



where diam(M) := sup{p(x,y) : 1,1/6 M} ([Gibbs and Sul . |2002| . Theorem 2). This means 
W and TV are stronger than q, while W and are equivalent (i.e., induce the same topology) 
when M is bounded. By Theorem 4 in lGibbs and Su (2002), TV and W are related as 



W(F, Q) < diam(M)Ty(P, ( 



(55) 



which means W and TV are comparable if M is bounded. See Shorack (|200d . Chapter 19, 
Theorem 2.4) and lGibbs and Sul |2002h for the relationship between various metrics on '. 

Now, let us consider a sequence of of probability measures on R, P n := (l — i) Sq + ^5 n 
and let P := <5o- It can be shown that /3(P n , P) — > as n — > 00 which means P n P, while 
W(P„,P) = 1 and rV"(P„,P) = 1 for all n. 7 fc (Pn,P) can be computed as 



71- 



7 1(P„,P) = ^ // k(x,y)d(5 -d n )(x)d(5 -8 n )(y) 



k(0,0) + k(n,n) -2k(0,n) 



•nf 



(56) 



If k is, e.g., a Gaussian, Laplacian or inverse multiquadratic kernel, then 7fc(P n ,P) — > as 
n — > 00. This example shows that 7^ is weaker than W and TV. It also shows that 
behaves similar to j3 and leads to several questions we want to answer: Does 7^ metrize 
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the weak topology on 3*1 What is the general behavior of 7^ compared to other metrics? 
In other words, depending on k, how weak or strong is 7^ compared to other metrics on 
Understanding the answer to these questions is important both in theory and practice. 
If k is characterized such that 7& metrizes the weak topology on & , then it can be used 
as a theoretical tool in probability theory, similar to the Prohorov and Dudley metrics. 
On the other hand, the answer to these questions is critical in applications as it will have 
a bearing on the choice of kernels to be used. In applications like density estimation, one 
would need a strong metric to ascertain that the density estimate is a good representation of 
the true underlying density. For this reason, usually, the total variation distance, Hellinger 
distance or Kullback-Leibler distance are used. Studying the relation of 7^ to these metrics 
will provide an understanding about the choice of kernels to be used, depending on the 
application. 

With the above motivation, in the following, we first compare 7^ to /3, W and TV. 
Since (3 is equivalent to we do not compare 7^ to <j. Before we provide the main result 
in Theorem [21] that compares 7^ to other metr ics, we present an upper bound on 7^ in 
terms of the coupling formulation (iDudleyl . l2002l . Section 11.8), which is not only useful in 
deriving the main result but also interesting in its own right. 

Proposition 20 (Coupling bound) Let k be measurable and bounded on M . Then, for 
any P,Q G & , 

7fc (P,Q)< inf // \\k(;x)-k(;y)\\xd fJl (x,y), (57) 
fte£(P,Q) J J m 

where £(P, Q) represents the set of all laws on M x M with marginals P and Q. 
Proof For any fj, G £(P, Q), we have 



M 



fd(¥ 



< 



(f( x ) ~ f(y))dfi(x,y) 

M 

\(f,H-,x) - k(;y))%\ dfi(x,y) 

> x ) ~ k(;y)\\xd(j,(x,y) 



< II \f(x)-f(y)\dfi(x,y) 

M 



(58) 



Taking the supremum over / G 3"^ and the infimum over fi G £(P, Q) in (I58p . where 
P, Q G gives the result in §5j 



We now present the main result that compares 7^ to f3, W and TV. 

Theorem 21 (Comparison of 7^ to j3, W and TV) Assume sup x&M k(x, x) <C< 00, 
where k is measurable on M . Let 



p(x,y) = \\k(-,x) - k(-,y)\\x- 

Then, for any P, Q G & , 
(i) 7 fe(F,Q) < W(F,Q) < ^ 7 |(P,Q)+4C if (M,p) is separable. 



(59) 



(ii) 



7fc(, 



ii+Vc) 



< /3(P,Q) < 2( 7 |(P,Q) + 4C)a if{M,p) is separable. 
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(Hi) 7fe (P,Q) < VCTV(F,Q). 

The proof is provided in Section 15.11 Below are some remarks on Theorem [2TJ 

Remark 22 (a) First, note that, since k is bounded, (M,p) is a bounded metric space. In 
addition, the metr ic, p, which depends on the kernel as in \59\l. is a Hilbertian metri^\ 
ItBerg et ~al\ . 198^ . Chapter 3, Section 3) on M . A popular example of such a metric is 



p(x,y) = \\x — y||2, which can be obtained by choosing M to be a compact subset ofM, d and 
k(x, y) = x T y. 

(b) Theorem{21\ shows that 7^ is weaker than j3, W and TV for the assumptions being made 
on k and p. Note that the result holds irrespective of whether the kernel is characteristic 
or not, as we have not assumed anything about the kernel except it being measurable and 
bounded. Also, it is important to remember that the result holds when p is Hilbertian, as 
mentioned in $59\) (see Remark\22Vd)). 

(c) Apart from showing that 7^ is weaker than f3, W and TV, the result in Theorem [2l\ 
can be used to bound these metrics in terms 0/7^. For j3, which is primarily of theoretical 
interest, we do not know a clos ed form expressi o n, whereas a closed form expression to 



compute W is known only for M /(Vallanden , \1973l)\ 12 \ Since 7^ is easy to compute (see [T3\ 



and bounds on W can be obtained from Theorem\2l\in terms of 7^. A closed form 

expression for TV is available if¥ and Q have Radon-Nikodym derivatives w.r.t. a a-finite 
measure. However, from Theorem\2Jl a simple lower bound can be obtained on TV in terms 
of Ik for any P,QG^. 

(d) In Theorem \21\ the kernel is fixed and p is defined as in $59\). which is a Hilbertian 
metric. On the other hand, supp ose a Hilbertian m etric, p is given. Then, the associated 



kernel, k can be obtained from p \Bera et all . \1984 - Chapter 3, Lemma 2.1) as 

1 



K x ,v) = ^[^(^o) + p 2 (y,x ) - p*(x,y)], x,y,x eM, (60) 

which can then be used to compute "f^- 

The discussion so far has been devoted to relating 7^ to f3, W and TV to understand the 
strength or weakness of 7^ w.r.t. these metrics. In a next step, we address the other question 
of when 7^ metrizes the weak topology on 8? . This question would have been answered 
had the result in Theorem 1211 shown that under some conditions on k, 7^ is equivalent to (3. 
Since Theorem [2T1 does not throw light on the question we are interested in, we approach the 
problem differently. In the following, we provide two results related to this question. The 
first result states that when (M, p) is compact, jk induced by universal kernels metrizes the 
weak topology. In the second result, we relax the assumption of compactness but restrict 
ourselves to M = l rf and provide a sufficient condition on k such that 7^ metrizes the weak 
topology on The proofs of both theorems are provided in Section [5.11 



11. A metric p on M is said to be Hilbertian if there exists a Hilbert space, H and a mapping $ such that 
p(x,y) = ||$(k) — Wx,y G M. In our case, H = "K and $ : M — > "H, x M> k(-,x). 

12. The explicit form for the Wasserstein distance is known for (M,p(x,y)) — (R, \x — y\), which is given as 
W(W,Q) = J R \F P (x) - F Q (x)\dx, where F P (x) = P((-oo,x]). It is easy to show that this explicit form 
can be extended to (R d , || ■ ||i). 



38 



Hilbert Space Embedding and Characteristic Kernels 



Theorem 23 (Weak convergence-I) Let (M, p) be a compact metric space. If k is uni- 
versal, then 7^ metrizes the weak topology on 3? '. 

From Theorem [23], it is clear that 7^ is equivalent to q, j3 and W (see ([52]) and ([53]) ) when 
M is compact and k is universal. 

Theorem 24 (Weak convergence-II) Let M = M. d and k(x,y) = ip(x — y), where ip £ 
Co(M d ) n L 1 (M. d ) is a real-valued bounded strictly positive definite function. If there exists 
an I £ N such that 

duo < 00, (61) 



/ 



then 7fc metrizes the weak topology on 2? . 

The entire Matern class of kernels in (122ft satisfies the conditions of Theorem [24] and, 
therefore, the corresponding 7^ metrizes the weak topology on 8? '. Note that Gaussian 
kernels on M. d do not satisfy the condition in Theorem [24j The characterization of k for 
general non-compact domains M (not necessarily W 1 ), such that 7^ metrizes the weak 
topology on still remains an open problem. 



5.1 Proofs 

We now present the proofs of Theorems [21~l 1231 and [211 

Proof (The orem 1211) (i) When (M,p) is separable, W(P, Q) has a coupling formulation 
(iDudlevi . liooi p. 420), gi 



nven as 



W(P,Q)= inf ft p(x,y)dfi(x,y), 
fj.eC(¥,Q)J J M 



(62) 



where P,Qe{Pe#: J M p(x,y) d¥(y) < 00, Vx £ M}. In our case p(x,y) = ||A;(-,x) - 
k(-, y)||jc- I n addition, (M,p) is bounded, which means ([62]) holds for all P, Q £ The 
lower bound therefore follows from (157j) . The upper bound can be obtained as follows. 
Consider W(P, Q) = inf /tg £( P q\ JJ m ||A;(-,x) — fe(-, y)||j{ d//(x, y), which can be bounded as 



W(P,Q) < I J \\k(;x)-k(;y)UdF(x)dQ(y) 



(a) 
< 



< 



< 



// 



;x)-k(;y)\\ldF(x)dQ(y) 



k{x, x) d(F + (Q)( x ) - 2 / / k(x, y) dF(x) 

M J JM 



7i?(P,Q)+ // (k(x,x)-k(x,y))d(l 

M 



x,y) 



< J^(P,Q) + 4C, 



(63) 



where we have used Jensen's inequality ( Folland . 19991 . p. 109) in (a 
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(ii) Let ?:={/: 



■x 



< 00} and S := {/ : ||/||bl < 00}. For / G 3", we have 



bl = sup r + sup \f(x)\ = sup—— r — - + sup \ (f,k{-,x))x\ 

X7 L y p{x,y) X&M x+ y \\K^ X ) - k(-,y)\w xeM 

< (l + v / C)||/|k<oo, (64) 
which implies / £ 9 and, therefore, 3~ C S. For any P,Q £ 



7 fc(P,Q) = sup{|P/-Q/| :/€ 

< sup{|P/ - Q/| : \\f\\ BL <(l + y/C),fe J} 

< su P {|P/ - Qf\ : II/Hbl <(l + VC),fe 5} 
= (1 + VC)/3(P,Q). 

The u pper bound is obtained as follows. For any P,Q£ by Markov's inequality (iFollandl . 



1999j, Theorem 6.17), for all e > 0, we have 
e 2 ^\\k(.,X)-k(.,Y)U>e)< 



M 



where X and Y are distributed as P and Q respectively. Choose e such that e 3 = JJ M 



,x 



k(-,y)\\?L [ diJ,(x, y\ such that fj,(\\k(-,X) — k(-,Y)\\ji > e) < e. From the proof of Theorem 
11.3.5 of iDudlevi (|2002l ). when (M,p) is separable, we have 



MX,Y) > e) < e ?(P,Q)<e, 



which implies that 



< 



< 



(6) 



inf 



\/j.€C(P,®)J J M 

,x) - fc(-, 



M 



x) ~ A;(-,y)||^d^(x,2/) 
dP(x 



< ( 7fe 2 (P,Q) + 4C) 



(65) 



where (6) follows from (|63p . The result follows from (I52j) . 

(mj The proof of this result was presented in Sriperumbudur et al. ( 2009bl ) and is provided 
here for completene ss. To prove the result, we use ([57]) and the coupling formulation for 
TV dLindvali liflflj . p. 19), given as 



-TV(¥,Q)= inf a(X^Y), 



(66) 



where £(P, Q) is the set of all measures on M x M with marginals P and Q. Here, X and 
Y are distributed as P and Q respectively. Consider 



,x) - k(;y)\\ % < l {ai?ty} ||A:(.,s) - k{;y)\\x < 2VCl {x ^ y} . 



(67) 
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Taking expectations w.r.t. fi and the infimum over /i G £(P, Q) on both sides of (j67j) gives 
the desired result, which follows from dSj 



Proof (Theorem [23]) We need to show that for measures P, Pi, P 2 , . . . G &>, F n 4 P if and 
only if 7fc(P n ,P) — )• as n — > oo. One direction is trivial as P n ^> P implies 7fc(P n ,P) —¥ 
as n — t- oo. We prove the other direction as follows. Since k is universal, "K is dense in 
Cft(M), the space of bounded continuous functions, w.r.t. the uniform norm, i.e., for any 
/ G Cb(M) and every e > 0, there exists a g G 'K such that ||/ — g\\oo < e - Therefore, 

|P n / - P/| = |P n (/ - 5 ) + P( 5 - /) + (F n g - Fg)\ 

< F n \f-g\+F\f-g\ + \F n g-Fg\ 

< 2e + \F n g - Fg\ < 2e + \\g\\^ k (F n ,F). (68) 

Since 7 fc (P n ,P) ->• as n ->■ oo and e is arbitrary, |P n / - P/| -)• for any / G C h {M). ■ 

Proof (Theorem 124ft As mentioned in the proof of Theorem [23l one direction of the 
proof is straightforward: P n — >• P 7fc(P„,P) — > as n — > oo. Let us consider the other 
direction. Since is a strictly positive definite function, any / £ M 



satisfies (|Wendlandl . 120051 Theorem 10.12) 



duj <C oo. (69) 



Assume that 

sup(l + |M| 2 y|/H| 2 <oo, (70) 



for any Z G N, which means / G <5^. Let (|6ip be satisfied for some Z = Zo- Then, 



R d v(^) V ; ( w )(i + iMky 

< sup(l+ ||o;|| 2 ) io |/(w)| 2 / ^ du<oo, 

which means / G !K, i.e., if / G =5^, then / G !K, which implies C IK. Note that 
is dense in Co(P d ). Since ^ G Co(M d ), we have ?C C Co(M d ) and, therefore, "K is dense in 
Co(P (i ) w.r.t. the uniform norm. Suppose P,Pi,P2, . . . G Using a similar analysis as in 
the proof of Theorem 1231 it can be shown that for any / G Co(M. d ) and every e > 0, there 
exists a g G "K such that |P n /-P/| < 2e+\F n g-Fg\. Since e is arbitrary and 7 fc (P„, P) -)■ 
as n — >• oo, the result follows. ■ 



6. Conclusion and Discussion 

In this paper, we have studied various properties associated with a pseudometric, 7^ on 
9*, which is based on the Hilbert space embedding of probability measures. First, we 
studied the conditions on the kernel (called the characteristic kernel) under which 7^ is a 
metric and showed that, apart from universal kernels, a large family of bounded continuous 
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kernels induce a metric on (a) integrally strictly pd kernels and (b) translation invariant 
kernels on R d and T d that have the support of their Fourier transform to be R d and Z, d 
respectively. Next, we showed that there exist distinct distributions which will be considered 
close according to 7^ (whether or not the kernel is characteristic), and thus may be hard 
to distinguish based on finite samples. Finally, we compared 7^ to other metrics on £P and 
explicitly presented the conditions under which it induces a weak topology on 8P. These 
results together provide a strong theoretical foundation for using the 7^ metric in both 
statistics and machine learning applications. 

Now, we discuss two topics related to 7^, one about the choice of kernel parameter and 
the other about kernels defined on & . 

An important question that we did not discuss in this paper is how to choose a charac- 
teristic kernel. Let us consider the following setting: M = R d and k a (x,y) = exp(— a\\x — 
J/II2)) a G K+j a Gaussian kernel with a as the bandwidth parameter. {k a : a G R+} is the 
family of Gaussian kernels and {7^ : a G R+} is the associated family of distance measures 
indexed by the kernel parameter, a. Note that k a is characteristic for any a G R++ and, 
therefore, 7/^ is a metric on 2? for any a £ R++- In practice, one would prefer a single 
number that defines the distance between P and Q. The question therefore to be addressed 
is how to choose an appropriate a. Note that as a — > 0, k a — > 1 and as a — > 00, k a — > a.e., 
which means 7fc CT (P, Q) — 7-Oaser— > or a — > 00 for all P, Q G This behavior is also 
exhibited by k a (x, y) = exp(— a\\x — y||i), a > and k a (x, y) = a 2 / (a 2 + \\x — 2/ 1 1 § ) , cr > 0, 
which are also characteristic. This means choosing sufficiently small or sufficiently large a 
(depending on P and Q) makes 7fc CT (P, Q) arbitrarily small. Therefore, a must be chosen 
appropriately in applications to effectively distinguish between P and Q. 

To this end, one can consider the following modification to 7^, which yields a pseudo- 
metric on 



Note that 7 is the maximal RKHS distance between P and Q over a family, % of measurable 
and bounded positive definite kernels. It is easy to check that, if any k G % is characteristic, 
then 7 is a metric on Examples for % include: 



1. X g := { e -HI*-J/lli ; x ,yeR d :a£ R + }. 

2. % t := {e-*-*, x,y G R d : a G R+}. 

3. := \ e -°^y) x,y G M : a G R+}, where i/> : M x R is a negative definite 



4. 3C rfe/ := {/ °° e - A H a; -y!li dn a (\),x,y G P d , G ^+ : <r G S C where J(+ is the 
set of all finite nonnegative Borel measures, y, a on R + that are not concentrated at 
zero, etc. 



7 (P,Q) = sup{ 7fc (P,Q) : k G %} = sup{||Pfc - Qk\W ■ k G %}. 



(71) 




5 



%li n := {k\ = J2j=i ^jkj I is pd, J2j=i -\? = 1}j which is the linear combination of 
pd kernels {fcj}*- =1 . 



6 



% con := {/ca = J2j=i ^jkj I — 0, ^2j=\ Aj = 1}, which is the convex combination of 
pd kernels {fcj}*- =1 . 
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The idea and validity behind the proposal of 7 in (|7ip can be understood from a Bayesian 
perspective, where we define a non-negative finite measure A over %, and average 7^ over 
that measure, i.e., a(P, Q) := Jx'fkQP) Q) dX(k). This also yields a pseudometric on 
That said, a(P, Q) < A(3C)7(P, Q), VP, Q, which means that, if P and Q can be distinguished 
by a, then they can be distinguished by 7, but not vice-versa. In this sense, 7 is stronger 
than a and therefore studying 7 makes sense. One further complication with the Bayesian 
approach is in defining a sensible A over %. Note that 7fc can be obtained by defining 
\{k) = 8(k—ko) in a(P, Q). Future work will include analyzing 7 and i nvestigating its utility 
i n appl ications compared to that of 7^ (with a fixed kernel, k). Refer to Sriperumbudur et al.1 
( 2009al ) for some preliminary work, wherein we showed that 7(P m , Q n ) is a \Jmnj{ra + n)- 
consistent estimator of 7(P, Q), for the class % of kernels shown in the previous page. 



We now discuss how kernels on & can be obtained from 7^. As discussed in the paper, 
7^ is a Hilbertian metric on 8? '. Therefore, using (|60f) . the associated kernel can be easily 
computed as 



if(P,< 



k(-,x)dP(x), / k(;x)d®(x) 
M JM I <X 



M 



k(x,y)dP(x) dQ(y), 



where K : ,^¥> x — )• P is a positiv e definite kernel, w hich can be seen as the dot-product 
kernel on SP. Using the results in iBerg et al.l (|1984l . Chapter 3, Theorems 2.2 and 2.3), 
Gaussian and inverse multi-quadratic kernels on & can be defined as 



exp(-ajl( 



a>0 and K (P, 



respectively. Broadly, this relates to the work on Hilbertian m etri cs and positive defi- 
nite k ernels on probability measures by iHein and Bousquet (120051 ) and lFuglede and Topsod 

<|20oah . 
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Appendix A. Supplementary Results 

For completeness, we present the supplementary r esults that wer e used to prove the results 



in this paper. The following result is quoted from IFollandl (|1999i . Theorem 8.14) 



Theorem 25 Suppose <p £ L 1 (M d ), f (f)(x) dx = a and 4>t(x) = t d <p{t 1 x) for t > 0. /// 
is bounded and uniformly continuous on M. d , then f * <fit — )■ af uniformly as t — > 0. 
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By im posing slightly stronger conditions on (j), the following result quoted from iFolland 
(1999, Theorem 8.15) shows that / * <f>t — > af almost everywhere for / G L 



Theorem 26 Suppose \<t>(x)\ < C(l + ||x||2)~ d ~ e for some C, e > 0, and f <fi{x) dx = a. If 
f G L r (W i ) (1 < r < co), then f * (j>t{x) — > af(x) as t — > for every x in the Lebesgue set 
of f — in particular, for almost every x, and for every x at which f is continuous. 

Theorem 27 (Fourier transform of a measure) Let \i be a finite Borel measure on M. d . 
The Fourier transform of /U is given by 

p(u)= f e~ iulTx dfi{x), co G R d , (72) 

JR d 

which is a bounded, uniformly continuous function on M. d . In addition, fi satisfies the 
following properties: 



(i) J2(uj) = fi(—uj), Vw G M. d , i.e., /2 is conjugate symmetric, 

(ii) m(0) = 1. 



The following result, called the Riemann-Lebesgue lemma, is quoted from iRudinl ()1991 
Theorem 7.5). 



Lemma 28 (Riemann-Lebesgue) // / G i/(R d ) ? then f G C (R d ), and 

The follo wing theorem is a version of the Paley-Wiener theorem for distributions, and is 
proved in Rudinl ( 199ll . Theorem 7.23). 



Theorem 29 (Paley-Wiener) If f G @' d has compact support , then f is entire. 



The f ollowing lemma provides a property of entire functions, which is quoted from iRudin 



(1991, Lemma 7.21) 



Lemma 30 If f is an entire function in C d that vanishes on M. d , then f = 0. 
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